Keywords

1 Introduction

In social networks, people often create morphs, a special type of fake alternative names for avoiding internet censorship or some other purposes [9]. Creating morphs is very popular in Chinese social networks, such as Chinese Sina Weibo. As shown in 1, there is a piece of Chinese Sina Weibo tweet. Here a morph “Little Leo” was created to refer to “Leonardo Wilhelm DiCaprio” Footnote 1. The term of “Leonardo Wilhelm DiCaprio” is called this morph’s target entity.

Fig. 1.
figure 1

An example of morph use in Sina Weibo.

Morph resolution is very important in Nature Language Processing (NLP) tasks. In NLP, the first thing is to get the true meanings of words, especially including these morphs. Thus, the successful resolution of morphs is the foundation of many NLP tasks, such as word segmentation, text classification, text clustering, and machine translation.

Many approaches are proposed to solve morph resolution. Huang et al. [3] can be considered to have had the first study on this problem, but their method need a large amount of human-annotated data. Zhang et al. [16] proposed an end-to-end context-aware morph resolution system. Sha et al. [9] proposed a framework based on character-word embeddings and radical-character-word embeddings to explore the semantic links between morphs and target entities. These methods do not use the context information of morphs and target entities effectively, only use the context information of neighbor words of morphs and target entities. But there are some neighbor words are unrelated with the semantic links between morphs and target entities. There is still some room for improvement in accuracy of morphs resolution.

In this paper, we proposed a framework based on autoencoders combined with effective context information of morphs and target entities. First, we analyzed what context information are useful for morph resolution, and designed a context information filter to get effective context information by using pointwise mutual information. Second, we proposed a variant of autoencoders which can combine semantic vectors of morphs or target candidates and their effective context information, and we used the combined vectors as the semantic representations of morphs and target candidates respectively. Finally, we ranked target candidates based on similarity measurement of semantic meanings of morphs and target candidates. Using this method, we only take consider of the effective context information of morphs and target entities, and use autoencoders to get essential semantic characteristics of morphs and target entities. Experimental results show that our approach outperforms the state-of-the-art method.

Our paper offers the following contributions:

  1. 1.

    We proposed a new framework based on autoencoders combined with effective context information of morphs and target entities. Our approach outperforms the state-of-the-art method.

  2. 2.

    To get the effective context information of morphs and target entities, we leveraged pointwise mutual information between terms. This helps generate more accurate semantic representation of terms, and can improve the accuracy of morph resolution.

  3. 3.

    We proposed a variant of autoencoders to generate semantic representation of terms. The autoencoders can combine morphs or target entities and their effective context information and extract essential semantic characteristics of morphs and target entities.

2 Related Work

The study of morphs first appeared in some normalization work on non-formal texts using internet language. For example, Wong et al. [14] examine the phenomenon of word substitution based on phonetic patterns in Chinese chat language, such as replacing “” (Me, pronounced ‘wo’) with “” (pronounced ‘ou’), which is similar to morphs. Early normalization work on non-formal text mainly uses rules-based approaches [10, 14, 15]. Later, some approaches combine statistics learning with rules to work on the normalization task [2, 5, 12, 13]. Wang et al. [13] establish a probabilistic model based on typical features of non-formal texts including phonetic, abbreviation, replacement, etc., and train it through supervised learning on large corpus.

The concept of morph first appeared in the study of Huang et al. [3]. Huang et al. [3] study the basic features of morphs, including surface features, semantic features and social features. Based on these features, a simple classification model was designed for morph resolution. Zhang et al. [16] also propose an end-to-end system including morph identification and morph resolution. Sha et al. [9] propose a framework based on character-word embedding and radical-character-word embedding to resolve morph after analyzing the common characteristic of morphs and target entities from cross-source corpora. Zhang et al. summarize eight types of patterns of generating morphs, and also study how to generate new morphs automatically based on these patterns [16].

Autoencoders are neural networks capable of learning efficient representations of the input data, without any supervision [11]. Autoencoders can act as powerful feature detectors. There have been many variations of autoencoders. The context-sensitive autoencoders [1] integrate context information into autoencoders and obtain the joint encoding of input data. In this paper, we adopted a similar model of context-sensitive autoencoders to get the semantic representation of morphs and target candidates. We don’t need to prepare much annotation data since autoencoder is an unsupervised algorithm.

In this paper, aiming at making full use of effective context information of morphs and target entities, we proposed a new framework based on autoencoders combined with extracted effective context information. Compared with the current methods, our approach only incorporates the effective context information of related words, and outperforms the state-of-the-art methods.

3 Problem Formulation

Morph resolution: Given a set of morphs, our goal is to figure out a list of target candidates which are ranked on the probability of being the real target entity.

Given documents set \(D=\{d_1, d_2, \ldots , d_{|D|} \}\), and morphs set \(M=\{m_1, m_2, \ldots , m_{|M|} \}\). Each morph \(m_i\) in set M and their real target entities are all appeared in documents set D. Our task is to discover a list of target candidates from D for each \(m_i\), and rank the target candidates based on the probability of being the real target entity.

As shown in Fig. 1, the morph “Little Leo” was created to refer to “Leonardo Wilhelm DiCaprio” . Given the morph “Little Leo” and tweets set from the Sina Weibo, our goal is to discover a list target candidates from tweets and rank the target candidates based on the probability of being the real target entity. The word “Leonardo Wilhelm DiCaprio” is expected to the first result (the real target entity) in the ranked target candidates list.

4 Resolving Morphs Based on Autoencoders Combined with Effective Context Information

We designed a framework based on autoencoders combined with effective context information to solve this problem. The procedure of our algorithm is shown in Fig. 2. The procedure of morph resolution mainly consists of the following steps:

Fig. 2.
figure 2

The procedure of morph resolution.

  1. 1.

    Preprocessing

    In this step, we aim to filter out unrelated terms and extract target candidates \(E_{m_i}=\{e_1, e_2, \ldots , e_{|E_{mi}|} \}\). We use two steps to extract the target candidates: (1) tweets which contain morphs are retrieved. Then we can get the published time of these tweets as these morphs’ appearing time. Sha et al. discovered that morphs and target entities are highly consistent in temporal distribution [9]. Thus we set a time slot of 4 days to collect tweets which may contain target candidates of morphs; (2) since most morphs refer to named entities, such as the names of persons, organizations, locations, etc. We only need to focus on named entities in these tweets in order to find target candidates of morphs. We can use many off-the-shelf tools working on POS (Part of Speech) and NER (Named Entities Recognition) tasks, including NLPIR [17], Standford NER [6] and so on.

  2. 2.

    Extracting effective context information

    We leverage effective context information (ECI) to generate semantic representation of morphs and target candidates. The effective context information are contextual terms whose semantic relationship with their target term is closer than others. The effective context information can effectively distinguish the characters of the morphs and their targets entities from other terms.

  3. 3.

    Autoencoders combined with effective context information

    We use deep autoencoders (dAE) to get joint encoding representation of morphs or target candidates and their effective context information. Autoencoder can fusion different types of features and embed them into an encoding, which is much more flexible than traditional word embedding methods

  4. 4.

    Ranking target candidates

    After creating encoding representation of morphs and target candidates, we can rank target candidate \(e_j\) by calculating cosine similarity between encodings of morph and target candidate. The larger value of cosine similarity between the morph and the target candidate, the more likely the candidate is the real target entity of the morph. The ranked target candidates sequence \(\hat{T}_{mi}\) is the result of morph resolution.

In the following sections, we will focus on these two steps: “extracting effective context information” and “autoencoders combined with effective context information”.

4.1 Extracting Effective Context Information

To extract the effective context information, we use the pointwise mutual information (PMI) to select right terms that are related with morphs or target entities. PMI is easy to calculate:

$$\begin{aligned} PMI(x;y) = log\frac{p(x,y)}{p(x)p(y)} \end{aligned}$$
(1)

where p(x) and p(y) refers to the probability of occurrence of terms x and y in the corpus respectively. \(\frac{p(x,y)}{p(x)p(y)}\) represents the co-occurrence of two terms.

PMI quantifies the discrepancy between the probability of their coincidence given their joint distribution and their individual distributions, assuming independence. PMI maximizes when x and y are perfectly associated (i.e. \(p(x, y) = p(x)p(y)\)). We use PMI to find collocations and associations between words. Good collocation pairs have high PMI because the probability of co-occurrence is only slightly lower than the probabilities of occurrence of each word. Conversely, a pair of words whose probabilities of occurrence are considerably higher than their probability of co-occurrence gets a small PMI score.

Given a word w, we collect all contextual terms of w from preprocessed tweets which contain w as the set \(C_w\). Note that we also need to remove auxiliary and preposition, since they are useless for our following method. Next, for each term \(c_i \in C\), we will calculate PMI between w and \(c_i\), and get the terms of top-k PMI as effective context information of w. In the same way, we can get the effective contextual terms set of all morphs and target candidates.

Table 1. Contextual terms of top-5 PMI of morphs, target entities, and non-target entities.

Table 1 shows contextual terms of top-5 PMI of morphs, target entities, and non-target entities. Here we regard these contextual terms as effective context information. Each row shows the effective contextual terms of different words. The first and second rows show the effective contextual terms of morph “The Flash ” and its target entity “Wade ”; and the third and fourth rows show the effective contextual terms of non-target entities “Yao ”and “Beckham ”. We can discover that the effective contextual terms of the morph and its target entity are nearly consistent, but the effective contextual terms of the morph are completely different from those of non-target entities.

This means that the effective context terms can distinguish the target entity from non-target entities. Effective contextual terms have high PMI with the morph “The flash” and its target entity “Wade”, but have low PMI with those non-target entities like “Yao” and “Beckham” .

The results mean that we can extract effective contextual terms set by using PMI. Morphs and target entities should have similar context information, so we could extract similar contextual terms set of morphs and target entities by using PMI. Through these contextual terms, we can get more accurate semantic links between morphs and target entities.

4.2 Incorporating Effective Context Information into Autoencoders

In this section, we want to get the representation of essential characters of morphs or target candidates by incorporating effective contextual terms into autoencoders.

Autoencoders neural network is an unsupervised learning algorithm, which encodes its input x into the hidden representation h, then reconstruct x with h precisely:

$$\begin{aligned} h = g(Wx + b) \end{aligned}$$
(2)
$$\begin{aligned} \hat{x} = g(W'h+b') \end{aligned}$$
(3)

\(\hat{x}\) is the reconstruction of x. \(W \in R^{d \times d'}, b \in R^{d'}, W' \in R^{d' \times d}\) are the parameters the model learned during training, d and \(d'\) means dimension of vectors before and after encoder respectively. Usually \(d' \le d\) for dimensionality reduction. Function g is the activation function in neural network. Figure 3(a) shows the structure of a basic single-layer autoencoder, it can be a cell in deep autoencoders (dAE).

Fig. 3.
figure 3

(a) A basic autoencoders cell; (b) autoencoders combined with effective context information using PMI filter.

In order to incorporate effective context information into autoencoders, we need to extend the inputs of autoencoders. As Fig. 3(b) shown, besides the term w, we also input the effective context information of w. First, we extract \(C_w^f\), the effective context information of w by using methods in Sect. 4.1. Second, we generate the word embeddings of each term in \(C_w^f\), and set \(cc_x\) as the average of these word embeddings. There are many word embedding methods, such as word2vec [7] or GloVe [8]. Third, we generate \(u_{cc}\), the hidden encoding representation of effective context information by using autoencoders whose input is \(cc_x\). Finally, we can incorporate \(u_{cc}\) into deep autoencoders to generate the joint encoding for input terms and their effective context information.

For each layer \(k^{th}\) in deep autoencoders, the encoder turns \(h_{k-1} (h_0 = x\ if\ k = 1)\) and \(u_{cc}\) into one hidden presentation as follows:

$$\begin{aligned} h = g(W_k h_{k-1} + U_k u_{cc} + b_k) \end{aligned}$$
(4)
$$\begin{aligned} \hat{h}_{k-1} = g(W'_k h + b_k') \end{aligned}$$
(5)
$$\begin{aligned} \hat{u}_{cc} = g(U_k' h + b_{k}'') \end{aligned}$$
(6)

where \(\hat{h}_{k-1}\) and \(\hat{u}_{cc}\) are the reconstruction of \(h_{k-1}\) and \(u_{cc}\). Equation 4 encodes \(h_{k-1}\) and \(u_{cc}\) into intermediate representation h; and Eqs. 5 and 6 decode h into \(h_{k-1}\) and \(u_{cc}\). \(W_k, U_k, b_k, W'_k, b'_k, U'_k\), and \(b''_k\) are the parameters the model learned during training. The whole model is composed of a stacked set of these layers. The last hidden layer \(h_d\) is the joint encoding for input terms and their effective context information.

For the whole model, the loss function must include both deviation of \((x, \hat{x})\) and \((u_{cc}, \hat{u}_{cc})\):

$$\begin{aligned} loss(x, u_{cc}) = \left\| x - \hat{x} \right\| ^2 + \lambda \left\| u_{cc} - \hat{u}_{cc} \right\| ^2 \end{aligned}$$
(7)

where \(\lambda \in [0,1]\) is the weight that controls the effect of context information during encoding. And the optimize target is to minimize the overall loss:

$$\begin{aligned} \begin{aligned}&\min _{\varTheta } \sum _{i=1}^n loss(x^i, u_{cc}^i), \\ \varTheta =&\{ W_k, W_k', U_k, U_k', b_k, b_k', b_k'' \},~~ k \in 1,2,..., depth \end{aligned} \end{aligned}$$
(8)

we can use back-propagation and the stochastic gradient descent algorithm to learn parameters during training. The autoencoders combined with effective context information is an unsupervised neural network, so we can train the model with a little annotation data.

After training, we can use the autoencoders to generate encoding representations of morphs and target candidates. First, we obtain initial embedding vectors of terms and effective context information, then input these vectors into the autoencoders to obtain the last hidden layer representation, the joint encoding representations of morphs and target candidates respectively. Next, we can rank target candidates by calculating cosine similarity between the joint encodings of morphs and target candidates.

5 Experiments and Analysis

5.1 Datasets

We updated the datasets of Huang’s work [3], and added some new morphs and tweets. At last, the datasets include 1,597,416 tweets from Chinese Sina Weibo and 25,003 tweets from Twitter. The time period of these tweets is from May, 2012 to June, 2012 and Sept, 2017 to Oct, 2017. There are 593 pairs of morphs in datasets.

5.2 Parameters Setting

In order to get the appropriate parameters in the model, we randomly selected 50,000 tweets as the verification set to adjust the parameters, including the context window wd, the number of terms for effective context information K of the PMI context filter, and the depth, the encoding dimension d2 and \(\lambda \) of the autoencoders. We choose the best parameters after the test on the verification set. During preprocessing, according to Sha’s work [9], we set the time window of China Sina Weibo as one day, set the time window of Twitter as three days when we obtain the target candidates. In the initialization of vectors of terms, we choose word2vec [7] to generate 100-dimensional word vectors, which is a popular choice among studies of semantic representation of word embedding. For PMI context filter, we set the context window \(wd = 20\), the number of terms for effective context information \(K = 100\). For the autoencoder, set the depth as 3, the encoding dimension \(d2 = 100\) and \(\lambda = 0.5\). In the experiment, we use Adaptive Moment Estimation (Adam) [4] to speedup the convergence rate of SGD. Later we will discuss the effects of different parameters in the task of morph resolution.

5.3 Results

We choose indicator Presice@K to evaluate the result of morph resolution since the result of this task is a ranked sequence. In this paper, \(Precise@K = N_k/Q\), means for each morph \(m_i\), if the position of \(e_{m_i}\) that is the real target of \(m_i\) in result sequence \(T_m\) is p, then \(N_k\) means the number of resolution results that \(p \le k\), and Q is the total number of morphs for the test. The performance of our approach and some other approaches are presented in Table 2 and Fig. 4. Huang et al. refers to work in [3], Zhang et al. refers to work in [16], CW refers to work by Sha et al. [9], while our approach is marked as AE-ECI. From the result we can find that our approach outperforms state-of-the-art methods.

Fig. 4.
figure 4

Performance of several approaches on pre@k for morph resolution.

The results show that the introduction of effective context information improves the accuracy of morph resolution. The current best method, Sha’s work, just directly uses word embedding to calculate cosine similarity among words. This method only considers context information of neighbor words of morphs or target entities. But there are some neighbor words not having semantic links between morphs or target entities. In our approach, we selects terms that can effectively distinguish the characteristics of target entities from non-target entities by using PMI. Thus we can resolve the morphs more precisely.

Table 2. Performance of several approaches on pre@k for morph resolution.

5.4 Analysis

In this section, we discuss the effects of different parameters.

Window Size and Number of Context Terms. In PMI context filter, we select different window sizes wd and different numbers of contextual terms K to find out impact of window size and number of context terms. However, it seems that wd and K have little impact on performance. The details are shown as Table 3.

Table 3. Effects of Window size and number of context terms.

Depth and Dimension of Autoencoder. Depth and dimension of autoencoders also have impact on performance. We select different combinations of depth and dimension for experimental verification, and the results show that too large or too small dimension has negative impact on performance. The possible reason may be that the ability of representation of autoencoders with too small dimension is insufficient, while autoencoders with too large dimension is hard to train. The impact of depth is similar. It seems that effect of depth is not very obvious when depth is small; but too deep model performs worse. The details are shown as Table 4.

Table 4. Effects of depth and dimension of autoencoders.

Lambda. \(\lambda \) is the weight that controls effects of effective context information in encoding. We test Pre@1 of morph resolution at different values of \(\lambda \). When \(\lambda = 0\) it means the effective context information is not added into the model. As shown in Table 5, we find that adding effective context information can improve the performance of model. If \(\lambda \) is too large, it will have negative impact on performance.

Table 5. Effects of \(\lambda \).

6 Conclusion

In this paper, we proposed a new approach to solve the problem of morph resolution. By analyzing the features of contextual terms of morphs and their targets, we try to extract effective context information based on PMI. We also proposed autoencoders combined with effective context information to get semantic representations of morphs and target entities. Experimental results demonstrate that our approach outperforms the state-of-the-art work on morph resolution. Next, we will try to extract topic information and integrate it to our models to improve the accuracy of morph resolution, and explore the better ways to fuse the semantic vectors of morphs or target entities and contextual terms.