Visual exploration and comparison of word embeddings

doi:10.1016/j.jvlc.2018.08.008

Journal of Visual Languages & Computing

Volume 48, October 2018, Pages 178-186

https://doi.org/10.1016/j.jvlc.2018.08.008 Get rights and content

Abstract

Word embeddings are distributed representations for natural language words, and have been wildly used in many natural language processing tasks. The word embedding space contains local clusters with semantically similar words and meaningful directions, such as the analogy. However, there are different training algorithms and text corpora, which both have a different impact on the generated word embeddings. In this paper, we propose a visual analytics system to visually explore and compare word embeddings trained by different algorithms and corpora. The word embedding spaces are compared from three aspects, i.e., local clusters, semantic directions and diachronic changes, to understand the similarity and differences between word embeddings.

Introduction

The word embedding is a kind of mathematical representation of vocabulary. Usually, there are two kinds of representations: one-hot vector representation and distributed representation. One-hot vector representation easily comes to our minds, which uses the index in the dictionary to represent word uniquely. However, this representation method only separates word and does not express the semantic meanings of the word. Distributed representation is a vector of real numbers, originally proposed by Hinton [1] in 1986, and it can encode semantic information compared to the one-hot vector representation. Euclidean distance or cosine similarity between distributed vectors can be used to measure the semantic similarity of words. Due to these advantages, distributed representation is widely used in many natural language processing tasks.

In order to generate the distributed representation of words, Xu and Rudnicky [2] applied neural networks to train word embeddings in 2000. The classic training method is to build a model by a three-layer neural network proposed by Bengio et al. [3], and many subsequent algorithms are based on this work. Until 2013, Mikolov et al. [4] proposed CBOW and Skip-gram methods. Google made it as an open source project named word2vec based on these methods and made word embeddings widely accepted by users. Thus, we use word2vec to train our word embeddings in this paper. Since the distributed representation captures semantic meanings of words, it is very popular in natural language processing tasks and can significantly improve performance in downstream tasks, such as text classification [5], sentiment analysis [6], [7], and semantic analysis [8]. For linguists, word embeddings not only can help them understand the structure of language from a macroscopic perspective, but also analyze the usage and meaning of words in a fine-grained manner.

The original word embedding data is a kind of vector data of tens of dimensions or even hundreds of dimensions. The high-dimensional data is difficult to interpret. It is complicated to show the spatial information in the high-dimensional space. Although we know that semantically similar words would be together, it is still hard to imagine how these words will be distributed in the high-dimensional space. We need to employ visualization techniques to enhance our understanding of the word embedding space. The word embedding space contains semantic information, such as synonym information, and it can be interpreted by the nearest neighbor words or clustering results. It also contains other semantic information, such as the word analogy relationship, and such information requires different visualization techniques to reveal its underlying relationships and structure.

Different corpora, parameters settings and training methods may lead to different word embedding spaces, and they are difficult to evaluate mathematically. The data, what we used to train the word embeddings, is a kind of large text and usually in Gigabytes. In our paper, we use CBOW and Skip-gram method, which are totally different in model architecture, which will discuss in Section 3. Therefore, we hope to design a visual analytics system to explore and compare the similarity and differences between two word embedding spaces trained from different corpora or methods. The contributions of this paper are:

•
We propose an interactive visual analytics system to understand and compare the word embedding spaces, in order to obtain an intuitive understanding between the spaces.
•
A case study demonstrating insights in word embeddings with different training algorithms and corpora, which reveals some interesting results like latent semantic changes in words.

Section snippets

Related work

The word embeddings encode a word as a point in the high-dimensional space. To some extent, the spatial information corresponds to the semantic information of the word. Word embeddings training methods are various and most of them based on statistical language models proposed by Bengio et al. [3]. The most widely used algorithms are Google’s word2vec [4] and Glove [9]. Lai et al. [10] outlined lots of word embeddings training methods and provided a number of evaluation criteria. Many methods

Background

Word embeddings have evolved since the concept emerged. The one-hot representation use indexes to represent words but we know little about semantic information. For enriching the meanings of word embeddings, researchers use words frequencies and context information to encode the semantic meaning in word embeddings. Therefore, the distributed representations of words came up. With the popularity of neural network techniques, Mikolov et al. [4] proposed the CBOW and Skip-gram to generate word

Tasks

Word embeddings are an essential part of many natural language processing tasks. The quality of word embeddings strongly affects the performance of these tasks. However, the evaluation of word embeddings is challenging and there are no quantitative baseline criteria. In order to explore and compare the semantic properties of the high-dimensional word embedding spaces, we define four tasks after reviewing the literature related to word embeddings in the natural language processing field.

•
Task 1:

Visual design

In this section, we introduce our method. First, we define our data format and the alignments algorithm. Then we focus on our visual analytic views.

Data preparation

Our corpora come from Wikipedia English corpora in 2017, New York Times corpora from Linguistic Data Consortium (LDC) in 1987 and Yelp comments. We trained word embeddings based on these corpora by word2vec in the genism package. For the training parameters, we set the window size as five and minimum ignoring word count as ten to generate a word embedding set in a 200 dimension. For comparisons, we use the CBOW and Skip-gram algorithm to train corpora respectively with the same parameters.

Difference caused by training algorithms

Conclusion and future work

In this paper, we designed an interactive visual analytic system, including the clustering view, semantic similar word view and analogy view, to visually understand, explore and compare word embeddings. We compare three different aspects of word embeddings, and find some interesting observations. In the cluster view, we find that the word nanny has the meaning of babysitter in most of the case rather than the labeled meaning female goat. The analogy view shows the analogy pair striking:struck

References (37)

Y. Yang et al.
Vistopic: a visual analytics system for making sense of large document collections using hierarchical topic modeling
Vis. Inform.
(2017)
C. Li et al.
Metro-wordle: an interactive visualization for urban text distributions based on wordle
Vis. Inform.
(2018)
G. E. Hinton, Learning distributed representations of concepts, in: Proceedings of the Eighth Annual Conference of the...
W. Xu, A. I. Rudnicky, Can artificial neural networks learn language models?, in: Proceedings of the Sixth...
Y. Bengio et al.
A neural probabilistic language model
J. Mach. Learn. Res.
(2003)
T. Mikolov et al.
Efficient Estimation of Word Representations in Vector Space
In International Conference on Learning Representations
(2013)
M. Kusner, Y. Sun, N. Kolkin, K. Weinberger, From word embeddings to document distances, in: Proceedings of the...
J. Xu, Y. Tao, H. Lin, R. Zhu, Y. Yan, Exploring controversy via sentiment divergences of aspects in reviews, in:...
J. Xu et al.
Vaut: a visual analytics system of spatiotemporal urban topics in reviews
J. Vis.
(2018)
R. Socher, J. Bauer, C.D. Manning, et al., Parsing with compositional vector grammars, in: Proceedings of the...

J. Pennington, R. Socher, C. Manning, Glove: global vectors for word representation, in: Proceedings of the Conference...

S. Lai et al.

How to generate a good word embedding

IEEE Intell. Syst.

(2016)

O. Levy et al.

Improving distributional similarity with lessons learned from word embeddings

Trans. Assoc. Comput. Linguist.

(2015)

J. Mu, S. Bhat, P. Viswanath, All-but-the-top: simple and effective postprocessing for word representations. (2017)...

A. Gittens, D. Achlioptas, M.W. Mahoney, Skip-gram-zipf+ uniform= vector additivity, in: Proceedings of the Fifty-Fifth...

A. Globerson et al.

Sufficient dimensionality reduction

J. Mach. Learn. Res.

(2003)

X. Rong, E. Adar, Visual tools for debugging neural language models, in: Proceedings of the ICML Workshop on...

S. Arora, Y. Liang, T. Ma, A simple but tough-to-beat baseline for sentence embeddings, in: Proceedings of the...

Cited by (0)

View full text

Visual exploration and comparison of word embeddings

Abstract

Introduction

Section snippets

Related work

Background

Tasks

Visual design

Data preparation

Difference caused by training algorithms

Conclusion and future work

Vis. Inform.

Vis. Inform.

A neural probabilistic language model

J. Mach. Learn. Res.

Efficient Estimation of Word Representations in Vector Space

In International Conference on Learning Representations

Vaut: a visual analytics system of spatiotemporal urban topics in reviews

J. Vis.

How to generate a good word embedding

IEEE Intell. Syst.

Improving distributional similarity with lessons learned from word embeddings

Trans. Assoc. Comput. Linguist.

Sufficient dimensionality reduction

J. Mach. Learn. Res.