ABSTRACT
Word embeddings have become essential components in various information retrieval and natural language processing tasks, such as ranking, document classification, and question answering. However, despite their widespread use, traditional word embedding models present a limitation in their static nature, which hampers their ability to adapt to the constantly evolving language patterns that emerge in sources such as social media and the web (e.g., new hashtags or brand names). To overcome this problem, incremental word embedding algorithms are introduced, capable of dynamically updating word representations in response to new language patterns and processing continuous data streams.
This paper presents RiverText, a Python library for training and evaluating incremental word embeddings from text data streams. Our tool is a resource for the information retrieval and natural language processing communities that work with word embeddings in streaming scenarios, such as analyzing social media. The library implements different incremental word embedding techniques, such as Skip-gram, Continuous Bag of Words, and Word Context Matrix, in a standardized framework. In addition, it uses PyTorch as its backend for neural network training.
We have implemented a module that adapts existing intrinsic static word embedding evaluation tasks for word similarity and word categorization to a streaming setting. Finally, we compare the implemented methods with different hyperparameter settings and discuss the results.
Our open-source library is available at https://github.com/dccuchile/rivertext.
Supplemental Material
- RR Ade and PR Deshmukh. 2013. Methods for incremental learning: a survey. International Journal of Data Mining & Knowledge Management Process, Vol. 3, 4 (2013), 119.Google ScholarCross Ref
- Charu C Aggarwal. 2007. Data streams: models and algorithms. Vol. 31. Springer.Google Scholar
- Abdulrahman Almuhareb and Massimo Poesio. 2005. Concept learning and categorization from the web. In proceedings of the annual meeting of the Cognitive Science society, Vol. 27.Google Scholar
- Matej Artac, Matjaz Jogan, and Ales Leonardis. 2002. Incremental PCA for on-line visual learning and recognition. In 2002 International Conference on Pattern Recognition, Vol. 3. IEEE, 781--784.Google ScholarCross Ref
- Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dalcin, Dag Sverre Seljebotn, and Kurt Smith. 2010. Cython: The best of both worlds. Computing in Science & Engineering, Vol. 13, 2 (2010), 31--39.Google ScholarDigital Library
- Albert Bifet, Ricard Gavalda, Geoffrey Holmes, and Bernhard Pfahringer. 2018. Machine learning for data streams: with practical examples in MOA. MIT press.Google Scholar
- Albert Bifet, Geoffrey Holmes, and Bernhard Pfahringer. 2011. Moa-tweetreader: real-time analysis in twitter streaming data. In International conference on discovery science. Springer, 46--60.Google ScholarDigital Library
- Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Philipp Kranen, Hardy Kremer, Timm Jansen, and Thomas Seidl. 2010. Moa: Massive online analysis, a framework for stream classification and clustering. In Proceedings of the first workshop on applications of pattern analysis. PMLR, 44--50.Google Scholar
- Felipe Bravo-Marquez, Arun Khanchandani, and Bernhard Pfahringer. 2022. Incremental Word Vectors for Time-Evolving Sentiment Lexicon Induction. Cognitive Computation, Vol. 14, 1 (2022), 425--441.Google ScholarCross Ref
- Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. Journal of artificial intelligence research, Vol. 49 (2014), 1--47.Google ScholarDigital Library
- Evandro Cunha, Gabriel Magno, Giovanni Comarela, Virgilio Almeida, Marcos André Goncc alves, and Fabricio Benevenuto. 2011. Analyzing the dynamic evolution of hashtags on twitter: a language-based approach. In Proceedings of the workshop on language in social media (LSM 2011). 58--65.Google Scholar
- Jurafsky Daniel, Martin James H, et al. 2007. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. prentice hall.Google Scholar
- Peter Emerson. 2013. The original Borda count and partial voting. Social Choice and Welfare, Vol. 40, 2 (2013), 353--358.Google ScholarCross Ref
- Xin Geng and Kate Smith-Miles. 2009. Incremental Learning.Google Scholar
- Anna Gladkova and Aleksandr Drozd. 2016. Intrinsic evaluations of word embeddings: What can we do better?. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP. 36--42.Google ScholarCross Ref
- Yoav Goldberg. 2017. Neural network methods for natural language processing. Synthesis lectures on human language technologies, Vol. 10, 1 (2017), 1--309.Google ScholarCross Ref
- Amit Goyal, Jagadeesh Jagarlamudi, Hal Daumé III, and Suresh Venkatasubramanian. 2010. Sketching techniques for large scale NLP. In Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop. 17--25.Google ScholarDigital Library
- Max Halford, Geoffrey Bolmier, Raphael Sourty, Robin Vaysse, and Adil Zouitine. 2019. creme, a Python library for online machine learning.Google Scholar
- Zellig S Harris. 1954. Distributional structure. Word, Vol. 10, 2--3 (1954), 146--162.Google ScholarCross Ref
- Sam Henry, Clint Cuffy, and Bridget T McInnes. 2018. Vector representations of multi-word terms for semantic relatedness. Journal of biomedical informatics, Vol. 77 (2018), 111--119.Google ScholarCross Ref
- Stanisław Jastrzebski, Damian Leśniak, and Wojciech Marian Czarnecki. 2017. How to evaluate word embeddings? on importance of data efficiency and simple supervised tasks. arXiv preprint arXiv:1702.02170 (2017).Google Scholar
- Nobuhiro Kaji and Hayato Kobayashi. 2017. Incremental skip-gram model with negative sampling. arXiv preprint arXiv:1704.03956 (2017).Google Scholar
- Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. Advances in neural information processing systems, Vol. 27 (2014).Google Scholar
- Christopher D Manning. 2008. Introduction to information retrieval. Syngress Publishing,.Google Scholar
- James H Martin. 2009. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Pearson/Prentice Hall.Google Scholar
- Chandler May, Kevin Duh, Benjamin Van Durme, and Ashwin Lall. 2017. Streaming word embeddings with the space-saving algorithm. arXiv preprint arXiv:1704.07463 (2017).Google Scholar
- Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2005. Efficient computation of frequent and top-k elements in data streams. In International conference on database theory. Springer, 398--412.Google Scholar
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, Vol. 26 (2013).Google ScholarDigital Library
- Jayadev Misra and David Gries. 1982. Finding repeated elements. Science of computer programming, Vol. 2, 2 (1982), 143--152.Google ScholarDigital Library
- Jacob Montiel, Max Halford, Saulo Martiello Mastelini, Geoffrey Bolmier, Raphael Sourty, Robin Vaysse, Adil Zouitine, Heitor Murilo Gomes, Jesse Read, Talel Abdessalem, et al. 2021. River: machine learning for streaming data in Python. (2021).Google Scholar
- Jacob Montiel, Jesse Read, Albert Bifet, and Talel Abdessalem. 2018. Scikit-multiflow: A multi-output streaming framework. The Journal of Machine Learning Research, Vol. 19, 1 (2018), 2915--2914.Google ScholarDigital Library
- Shanmugavelayutham Muthukrishnan et al. 2005. Data streams: Algorithms and applications. Foundations and Trends® in Theoretical Computer Science, Vol. 1, 2 (2005), 117--236.Google Scholar
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, Vol. 32 (2019).Google Scholar
- Hao Peng, Mengjiao Bao, Jianxin Li, Md Zakirul Alam Bhuiyan, Yaopeng Liu, Yu He, and Erica Yang. 2018. Incremental term representation learning for social network analysis. Future Generation Computer Systems, Vol. 86 (2018), 1503--1512.Google ScholarDigital Library
- Hao Peng, Jianxin Li, Yangqiu Song, and Yaopeng Liu. 2017. Incrementally learning the hierarchical softmax function for neural language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.Google ScholarCross Ref
- Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.Google ScholarCross Ref
- Savs a Petrović, Miles Osborne, and Victor Lavrenko. 2010. The edinburgh twitter corpus. In Proceedings of the NAACL HLT 2010 workshop on computational linguistics in a world of social media. 25--26.Google ScholarDigital Library
- Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. 2011. A word at a time: computing word relatedness using temporal semantic analysis. In Proceedings of the 20th international conference on World wide web. 337--346.Google ScholarDigital Library
- Jesse Read, Albert Bifet, Bernhard Pfahringer, and Geoff Holmes. [n.,d.]. Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data. ([n.,d.]).Google Scholar
- Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. 2015. Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015 conference on empirical methods in natural language processing. 298--307.Google ScholarCross Ref
- Ian Stewart, Dustin Arendt, Eric Bell, and Svitlana Volkova. 2017. Measuring, predicting and visualizing short-term change in word representation and usage in vkontakte social network. In Eleventh international AAAI conference on web and social media.Google ScholarCross Ref
- Peter D Turney and Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research, Vol. 37 (2010), 141--188.Google ScholarCross Ref
- Clark Wissler. 1905. The Spearman correlation formula. Science, Vol. 22, 558 (1905), 309--311.Google ScholarCross Ref
- Michael Zhai, Johnny Tan, and Jinho Choi. 2016. Intrinsic and extrinsic evaluations of word embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 30.Google ScholarCross Ref
Index Terms
- RiverText: A Python Library for Training and Evaluating Incremental Word Embeddings from Text Data Streams
Recommendations
Morphological Segmentation to Improve Crosslingual Word Embeddings for Low Resource Languages
Crosslingual word embeddings developed from multiple parallel corpora help in understanding the relationships between languages and improving the prediction quality of machine translation. However, in low resource languages with complex and ...
Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information RetrievalWe propose a new unified framework for monolingual (MoIR) and cross-lingual information retrieval (CLIR) which relies on the induction of dense real-valued word vectors known as word embeddings (WE) from comparable data. To this end, we make several ...
Unsupervised Bilingual Sentiment Word Embeddings for Cross-lingual Sentiment Classification
ICIAI '20: Proceedings of the 2020 the 4th International Conference on Innovation in Artificial IntelligenceIn recent years, bilingual word embeddings have been used to promote sentiment classification task in low-resource languages. However, existing bilingual word embedding methods either require annotated cross-lingual data or fail to capture enough ...
Comments