Keywords

1 Introduction

Using neural network for word embedding learning was first proposed as NNLM (Neural Network Language Model) [1]. Since then, word embedding methods have been proved to have excellent results in semantic analysis tasks [13]. It is evident that embedding is not only able to catch statistical information but syntactic information as well. Complex linguistic problems, such as exploring semantic change of words in discrete time periods [8] can thus be tackled properly. Moreover, embedding methods were used to detect large scale linguistic change-points [9] as well as to seek out regularities in acquiring language, such as related words tending to undergo parallel change over time [15].

Word2vec is typical low dimensional (usually from 50 to 300) word embedding method based on neural network, while SVD (Singular Value Decomposition) is a typical low dimensional word representation method based on matrix decomposition. These methods have achieved the state-of-the-art results in many diachronic analysis tasks [4, 5, 10] and diachronic-based applications [3, 16]. However, due to the inconsistent context and the stochastic nature of the above embedding methods, words in different periods cannot be easily encoded into the same vector space [6]. Also, due to the absence of Chinese historical corpus [6], currently diachronic analysis on Chinese is not studied well. The previous work normally encoded words from different time periods into separate vector spaces first, and then aligned the learned low-dimensional embeddings [9]. The inherent uncertainty of embeddings will lead to high variability of a given word’s closest neighbors [7]. The alignment approximates the relationship among vector spaces and forcibly consolidates inconsistent contexts onto a common context that will augment neighbor words’ variability. Moreover, as the pair of word context fluctuates over time, the forcible consolidation will tamper word semantics undesirably.

The problem of alignment can be avoided by not encoding words to low-dimensional space [8]. Among them, the Positive Point-wise Mutual Information (PPMI) [14] outperforms a wide variety of other high-dimensional approaches [2]. PPMI naturally aligns word vectors by constructing a high-dimensional sparse matrix, with each row representing a unique word, and each column corresponding to it’s context. Unfortunately, compared with low dimensional methods, PPMI introduces additional problems. In particular, building the PPMI matrix will consume a lot of computing resources in high-dimensional sparse environment. Though PPMI wards off alignment issues, it does not enjoy the advantages of low dimensional embeddings such as higher efficiency and better generalization.

In this paper, we first introduce three popular word representation methods, the PPMI, PPMI-based SVD and word2vec. Then we will introduce the experiment setup, including the data, data preprocessing and the evaluation metrics. Finally we will discuss the experiment results, the application of word diachronic analysis on Chinese and future works.

2 Related Work

Many contributions about word representation have been done by other researchers. In this paper, we mainly use three word representation methods. We choose PPMI as our sparse word representation method, which makes use of word co-occurrence information as its vector. We choose PPMI-based SVD and word2vec as our low-dimension word representation methods. The PPMI-based SVD can take the advantage of co-occurrence information and frequency information in the PPMI matrix. The word2vec can be considered as a neural network based word representation method which can take the advantage of the nonlinear expression ability of neural networks. We will introduce these three methods in detail in the following sections.

2.1 PPMI

In the PPMI, words are represented by constructing a high dimensional sparse matrix \(M \in R^{|V_w| \times |V_c|}\) where each row denotes a word w, and each column represents a context c. The value of the matrix cell \(M_{ij}\) is the PPMI value that suggests the associated relationship between the word \(w_i\) and the context \(c_j\), the value of matrix cell \(M_{ij}\) is obtained by:

$$\begin{aligned} M_{ij}&= max \left\{ log\left( \frac{p(w_i,c_j)}{p(w_i)p(c_j)}\right) ,0 \right\} \end{aligned}$$
(1)

2.2 PPMI-Based SVD

SVD embeddings implement dimensionality reduction over a sparse high-dimensional matrix S, of which each row represents a word and each column corresponds to a potential feature of the word. More concretely, SVD decomposes the sparse matrix S into the product of three matrices, \(S = U \cdot V \cdot T\), where both U and T are orthonormal, and V is a diagonal matrix of singular values ordered in the decent direction. In V, the top limited number of singular values retain most features of the word, that is, by keeping the top d singular values, we have \(S_d = U_d \cdot V_d \cdot T_d\) which approximates S, then the word embedding \(\overrightarrow{W}\) is approximated by:

$$\begin{aligned} \overrightarrow{W} \approx U_d \end{aligned}$$
(2)

2.3 Word2vec

Bengio et al. [1] mentioned that language neural network model can be used in other ways to reduce the number of training parameters, such as recurrent neural networks. In order to quickly obtain a good word vector set, Mikolov et al. [12] invented a word vector training tool named word2vecFootnote 1. It includes two models: the CBOW (Continuous Bag-Of-Word) model and the Skip-gram (continuous Skip-gram) model.

Both models are feed-forward neural network models, without the nonlinear hidden layer of the neural language model in [1]. This modification greatly simplified model structure and reduced training time.

In this paper, we use Skip-gram model as word2vec training model. Given an intermediate word \(w_t\) as the priori knowledge to predict the context of the other words, the training process of the Skip-gram model is implemented by maximizing the value of \(W_{t+j}\) in Eq. (4).

$$\begin{aligned} \frac{1}{|V|}\sum \nolimits _{t=1}^{|V|}\sum \nolimits _{-c \leqslant j \leqslant c,j \ne 0}logp(w_{t+j}|w_t) \end{aligned}$$
(3)

where |V| denotes the size of word dictionary obtained from the training corpus, c represents the number of context words before or after the middle word. p(.) is a softmax regression function shown as Eq. (5):

(4)

where \(v_w\) and \(v_w'\) are the vector representations for the input layer and the output layer respectively.

In this paper, we use Skip-gram model with negative-sampling from word2vec (in this paper, we name it as SGNS), which is a fast and effective method to build a word representation.

2.4 Detect Diachronic Changes

We detect semantic change-point of the word by measuring the semantic similarity of its two profiles in respective time periods. The measure is as follows:

$$\begin{aligned} sim(w|x1, x2) = cos(\overrightarrow{W^{x1}}, \overrightarrow{W^{x2}}) \end{aligned}$$
(5)

where cos() calculates the cosine similarity of two vectors, which is between [0, 1]. \(w^{x1}\) and \(w^{x2}\) are profiles referring to exactly the same word w but appearing in periods of x1 and x2, respectively. Smaller value means more significant difference, that is, the meaning of word shifts greatly over time. We measure the semantic difference of a specific word at every two periods.

2.5 Alignment Method

In order to compare word vectors from different time-periods, we must align the vectors from different time-periods. The PPMI is naturally aligned, but PPMI-based SVD and word2vec is not naturally aligned, these two method may result in arbitrary orthogonal transformations. According to [6], we use orthogonal Procrustes to align the learned word representations. Suppose \(W^{(t)} \in \mathbb {R}^{d \times |V|}\) as the matrix of word embeddings learned at year t. We align the embeddings by optimizing:

$$\begin{aligned} R{(t)} = \mathop {\arg \min }_{Q^TQ=I} || W^{(t)}Q - W^{(t+1)} ||_F \end{aligned}$$
(6)

3 Experiment Results

3.1 Preprocessing

Dataset. In this paper, we use a large set of search engine crawled web pages provided by Sogou Lab [11]. Provided data is organized as XML style label texts. The data is raw XML labeled data and has no time tag in labels. After we analyzed the data, we found that the URL of every document may contain time information (this usually occurs in news sites). So, in this experiment, we use the URL provided time information.

To extract the time information from document data, we use regex to match time pattern like “YYYY-MM-DD” or “YYYY_MM_DD” in url, urls without this pattern will be deleted. Then we apply a filter that can clean all html labels and other useless symbols to build the final training corpus.

The word segmentation tool jiebaFootnote 2 is applied for Chinese word segmentation (including the compound words).

After these preprocessing, we finally get about 52,324,791 lines of data consisting 467,826,233 words (225,182 unique words).

3.2 Number of Discovered Shift Words

The detection of linguistic change is shown by searching for semantically shifting words among time-periods and how many such kind of words are identified clearly in a given time slot. Two corpora, “Corpus 1998-2002” and “Corpus 2008-2012” are picked up because the two of 5-year range may exhibit the semantic shifting of the development process of Internet in China. We compare the semantic similarity of word vectors between two periods of time for each word. Moreover, in order to build a comprehensive understanding of the linguistic change, we identify shifting words with five different thresholds (the semantic similarities thresholds are set to 0.1, 0.2, 0.3, 0.4 and 0.5, respectively). The results are shown in Table 1. It is PPMI that detects the largest number of shifting words under the same conditions. From the Table 1. We can also find that the number of detected words of PPMI, SVD and SGNS are mainly distributed under 0.1, 0.4 and 0.5. This shows a interesting fact that the result of these three method have different distribution.

Table 1. Number of shifted words

3.3 Visualizing Semantic Evolution of Words

We select some semantic changed words in the corpus generated by our methods, visualize them in three periods (1998–2002, 2003–2007 and 2008–2012). To draw the visualization for a given semantic-changed word, we extract words which are semantically similar neighbors in each period, placing them around the word at different distances according to the vector distance between the word and its neighbors. The visualization of semantic changed words are shown in Figs. 1 and 2. In Fig. 1, we can find that the meaning of “copy” in Chinese changes its mean at about 2008–2012, moving towards “arena” in English. In Fig. 2, we can find that the meaning of “apple” in Chinese changes its mean at about 2003–2007, moving towards to “Apple Inc.” in English.

Fig. 1.
figure 1

Visualization of “copy” in Chinese

Fig. 2.
figure 2

Visualization of “apple” in Chinese

3.4 Discussions

Here we will discuss about three problems in our diachronic analysis result:

  • How to find the representation of the semantically active words, given that they may change in meaning over time.

  • How to judge a historical word embedding method is effective quantitatively.

  • How to find a relationship between diachronic shift words, linguistic evolution and social changes.

As discussed in Sect. 1, words from different time periods cannot be easily projected into the same dense vector space. Suppose now they are directly represented in a single dense vector space, what would we do to trace their semantic change? Even though every word has a different profile in each corpus, together they have to be represented by one unique vector. That is, it remains unchanged in the varying time period and sensitive context of the word. Also, the word embeddings learn word representations in global which can also make polysemous words not stable. So it’s still a problem need further study.

In this paper, we use different word distance thresholds to find the shifted words. As lack of ground truth data to reveal known word shifts, this may be a quantitative way to evaluate word embedding methods. But this kind of method is simple and inaccurate. A possible direction is to apply density analysis on word embedding results.

Also, as the result of our work, we can find the diachronic shift words, how to create a relationship between shift words, linguistic evolution and social changes is the next question. A possible direction is to design a global distance evaluation algorithm to embody the linguistic evolution, another possible direction is to use some social change event to analyze its relationship between shift words and social changes.

4 Conclusion and Future Work

Though low-dimensional method enjoy higher efficiency and better generalization, high-dimensional word representation methods still have its own advantages. In this paper, we use both high-dimensional and low-dimensional word representation method to build word vectors among time-periods. As current research have seldom work on Chinese corpus analysis, we use a large mount of data from Sogou search engine crawled pages to perform this analysis. After using three word representation method to train this Chinese corpus, we use different thresholds to get shifted words from the representation and visualize semantic evolution of words.

In the future, we mainly focus on research directions talked in discussion. The very next research topic is trying to reveal the relationship between word meaning shifts, linguistic evolution and social changes in Chinese corpus. We will also try to build a Chinese word shift ground truth data in order to achieve word shift evaluation quantitatively.