research-article

RiverText: A Python Library for Training and Evaluating Incremental Word Embeddings from Text Data Streams

Authors:

Felipe Bravo-MarquezAuthors Info & Claims

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 3027 - 3036

https://doi.org/10.1145/3539618.3591908

Published: 18 July 2023 Publication History

Get Access

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

RiverText: A Python Library for Training and Evaluating Incremental Word Embeddings from Text Data Streams

Pages 3027 - 3036

Abstract
Supplemental Material
References

Abstract

Word embeddings have become essential components in various information retrieval and natural language processing tasks, such as ranking, document classification, and question answering. However, despite their widespread use, traditional word embedding models present a limitation in their static nature, which hampers their ability to adapt to the constantly evolving language patterns that emerge in sources such as social media and the web (e.g., new hashtags or brand names). To overcome this problem, incremental word embedding algorithms are introduced, capable of dynamically updating word representations in response to new language patterns and processing continuous data streams.

This paper presents RiverText, a Python library for training and evaluating incremental word embeddings from text data streams. Our tool is a resource for the information retrieval and natural language processing communities that work with word embeddings in streaming scenarios, such as analyzing social media. The library implements different incremental word embedding techniques, such as Skip-gram, Continuous Bag of Words, and Word Context Matrix, in a standardized framework. In addition, it uses PyTorch as its backend for neural network training.

We have implemented a module that adapts existing intrinsic static word embedding evaluation tasks for word similarity and word categorization to a streaming setting. Finally, we compare the implemented methods with different hyperparameter settings and discuss the results.

Our open-source library is available at https://github.com/dccuchile/rivertext.

Supplemental Material

MP4 File

RiverText is a Python library designed to address the limitations of traditional word embedding models by providing a comprehensive framework for training and evaluating incremental word embeddings from data streams. These incremental words embeddings allow for dynamic updating of word representations in response to evolving language patterns in sources such as social media and the web. The library implements popular techniques like Skip-gram, Continuous Bag of Words, and Word Context Matrix, using PyTorch as the backend for efficient neural network training. In addition, RiverText includes a module that adapts static intrinsic NLP evaluation tasks to a streaming setting. The open-source library, available at https://dccuchile.github.io/rivertext/, provides detailed documentation and examples to facilitate quick and easy adoption. RiverText is expected to be a valuable resource for researchers and practitioners working with large-scale streaming text data.

Download
57.60 MB

References

[1]

RR Ade and PR Deshmukh. 2013. Methods for incremental learning: a survey. International Journal of Data Mining & Knowledge Management Process, Vol. 3, 4 (2013), 119.

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Morphological Segmentation to Improve Crosslingual Word Embeddings for Low Resource Languages

Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings

Unsupervised Bilingual Sentiment Word Embeddings for Cross-lingual Sentiment Classification

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations