skip to main content
research-article

Co-occurrence Weight Selection in Generation of Word Embeddings for Low Resource Languages

Published: 09 January 2019 Publication History

Abstract

This study aims to increase the performance of word embeddings by proposing a new weighting scheme for co-occurrence counting. The idea behind this new family of weights is to overcome the disadvantage of distant appearing word pairs, which are indeed semantically close, while representing them in the co-occurrence counting. For high-resource languages, this disadvantage might not be effective due to the high frequency of co-occurrence. However, when there are not enough available resources, such pairs suffer from being distant. To favour such pairs, a weighting scheme based on a polynomial fitting procedure is proposed to shift the weights up for distant words while the weights of nearby words are left almost unchanged. The parameter optimization for new weights and the effects of the weighting scheme are analysed for the English, Italian, and Turkish languages. A small portion of English resources and a quarter of Italian resources are utilized for demonstration purposes, as if these languages are low-resource languages. Performance increase is observed in analogy tests when the proposed weighting scheme is applied to relatively small corpora (i.e., mimicking low-resource languages) of both English and Italian. To show the effectiveness of the proposed scheme in small corpora, it is also shown for a large English corpus that the performance of the proposed weighting scheme cannot outperform the original weights. Since Turkish is relatively a low-resource language, it is demonstrated that the proposed weighting scheme can increase the performance of both analogy and similarity tests when all Turkish Wikipedia pages are utilized as a corpus. The positive effect of the proposed scheme has also been demonstrated in a standard sentiment analysis task for the Turkish language.

References

[1]
Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Paşca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 19--27.
[2]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
[3]
Giacomo Berardi, Andrea Esuli, and Diego Marcheggiani. 2015. Word embeddings go to Italy: A comparison of models and training datasets. In Proceedings of the Italian Information Retrieval Workshop.
[4]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016).
[5]
Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. J. Artif. Intell. Res. 49, 2014 (2014), 1--47.
[6]
Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 6 (1990), 391.
[7]
Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2014. Retrofitting word vectors to semantic lexicons. arXiv preprint arXiv:1411.4166 (2014).
[8]
John R. Firth. 1957. A synopsis of linguistic theory, 1930-1955. Studies in Linguistic Analysis (1957).
[9]
Eric H. Huang, Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 873--882.
[10]
Alexander G. Huth, Wendy A. de Heer, Thomas L. Griffiths, Frédéric E. Theunissen, and Jack L. Gallant. 2016. Natural speech reveals the semantic maps that tile human cerebral cortex. Nature 532, 7600 (2016), 453--458.
[11]
Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hervé Jégou, and Tomas Mikolov. 2016b. FastText. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016).
[12]
Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016a. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016).
[13]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128--3137.
[14]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.
[15]
Omer Levy and Yoav Goldberg. 2014. Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Vol. 2. 302--308.
[16]
Pierre Lison and Andrei Kutuzov. 2017. Redefining context windows for word embedding models: An experimental study. arXiv preprint arXiv:1411.4166 (2017).
[17]
Thang Luong, Richard Socher, and Christopher D. Manning. 2013. Better word representations with recursive neural networks for morphology. In Proceedings of the Computational Natural Language Learning Conference (CoNLL’13). 104--113.
[18]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[19]
Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013b. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013).
[20]
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013c. Linguistic regularities in continuous space word representations. In Proceedings of Human Language Technologies: The 2013 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Vol. 13. 746--751.
[21]
George Miller and Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. MIT Press, Cambridge.
[22]
George A Miller. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 39--41.
[23]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14), Vol. 14. 1532--1543.
[24]
Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. 2011. A word at a time: Computing word relatedness using temporal semantic analysis. In Proceedings of the 20th International Conference on World Wide Web. ACM, 337--346.
[25]
Peter Mark Roget. 1911. Roget’s Thesaurus of English Words and Phrases. TY Crowell Company.
[26]
Peter Mark Roget. 2008. Roget’s International Thesaurus, 3/E. Oxford and IBH Publishing.
[27]
Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11 (1975), 613--620.
[28]
Lütfi Kerem Şenel, Veysel Yücesoy, Aykut Koç, and Tolga Çukur. 2017a. Measuring cross-lingual semantic similarity across European languages. In Proceedings of the Conference on Telecommunications and Signal Processing. IEEE.
[29]
Lütfi Kerem Şenel, Veysel Yücesoy, Aykut Koç, and Tolga Çukur. 2017b. Semantic similarity between Turkish and European languages using word embeddings. In Proceedings of the Signal Processing and Communications Applications Conference (SIU’17). IEEE, 1--4.
[30]
Christopher Tillmann and Hermann Ney. 1997. Word triggers and the EM algorithm. In Proceedings of the Computational Natural Language Learning Conference (CoNLL’97). 117--124.
[31]
Veysel Yücesoy and Aykut Koç. 2016. Effect of the training set on the word embeddings and similarity test set for Turkish. In Proceedings of the Signal Processing and Communication Application Conference (SIU’16). IEEE, 1005--1008.
[32]
Veysel Yücesoy and Aykut Koç. 2017. Effect of cooccurance weighting to English word embeddings. In Proceedings of the Signal Processing and Communications Applications Conference (SIU’17). IEEE, 1--4.

Cited By

View all
  • (2023)Filtering and Extended Vocabulary based Translation for Low-resource Language Pair of Sanskrit-HindiACM Transactions on Asian and Low-Resource Language Information Processing10.1145/358049522:4(1-15)Online publication date: 19-Jan-2023
  • (2023)Impact of Tokenization on Language Models: An Analysis for TurkishACM Transactions on Asian and Low-Resource Language Information Processing10.1145/357870722:4(1-21)Online publication date: 30-Apr-2023
  • (2020)Multilingual Sentiment AnalysisDeep Learning-Based Approaches for Sentiment Analysis10.1007/978-981-15-1216-2_8(193-236)Online publication date: 25-Jan-2020
  • Show More Cited By

Index Terms

  1. Co-occurrence Weight Selection in Generation of Word Embeddings for Low Resource Languages

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 18, Issue 3
        September 2019
        386 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3305347
        Issue’s Table of Contents
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 09 January 2019
        Accepted: 01 September 2018
        Revised: 01 July 2018
        Received: 01 October 2017
        Published in TALLIP Volume 18, Issue 3

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Word embeddings
        2. co-occurrence weighting
        3. computational linguistics

        Qualifiers

        • Research-article
        • Research
        • Refereed

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)5
        • Downloads (Last 6 weeks)1
        Reflects downloads up to 21 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)Filtering and Extended Vocabulary based Translation for Low-resource Language Pair of Sanskrit-HindiACM Transactions on Asian and Low-Resource Language Information Processing10.1145/358049522:4(1-15)Online publication date: 19-Jan-2023
        • (2023)Impact of Tokenization on Language Models: An Analysis for TurkishACM Transactions on Asian and Low-Resource Language Information Processing10.1145/357870722:4(1-21)Online publication date: 30-Apr-2023
        • (2020)Multilingual Sentiment AnalysisDeep Learning-Based Approaches for Sentiment Analysis10.1007/978-981-15-1216-2_8(193-236)Online publication date: 25-Jan-2020
        • (2019)Skip-Gram-KR: Korean Word Embedding for Semantic ClusteringIEEE Access10.1109/ACCESS.2019.29052527(39948-39961)Online publication date: 2019

        View Options

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media