1 Introduction

Semantic similarity measures between terms or concepts, also known as synonym identification, becoming intensively used for most applications in natural language processing and information retrieval, such as sense disambiguation [11] and document classification [17]. Given a knowledge sources, semantic similarity measures compute the similarity between words in order to perform estimations.

Semantic similarity measures are usually performed using some kind of metrics. Some use statistical models [16] or lexical pattern, and other represent contexts using the vector space model [3], or make use of conceptual hierarchies [2]. Several similarity measures have been developed, being given the existence of a structured knowledge representation offered by ontologies and corpus which enable semantic interpretation of terms [6].

Despite effectiveness of previous studies, the study of semantic similarity measures in medical field in Chinese language is still a relatively new territory. There are several challenges: (1) The global context information and Chinese linguistic information, which play crucial roles in semantic comprehension, are under-explored in recent work for synonym identification. (2) Limited performance for learning the low frequency words in a large-scale corpus, compared with the high frequency words, resulting in the inaccurate entity knowledge presentation.

To alleviate these limitations, we propose a learning method MedSim that determines the intended concept associated with an ambiguous word using semantic similarity measures. In specific, we first explore the linguistic and contextual semantic features, so as to augment the amount of information of the low frequency words and relieve the demand of a corpus with complete knowledge. Then, we alleviate the limitations of plain word information by involving global context extracted from search engine. Finally, we concatenate all the information that has well-balanced the effectiveness and accuracy and compare its performance with the state-of-art approaches.

The contribution of this papers can be summarized as follows: (1) We learn the linguistic and contextual semantic features, which improve the knowledge representation of low frequency words; (2) We exploit the global context information to capture useful topical information of pairwise word, which improve the limitations of plain word information learning; (3) The experimental results show that Medsim consistently outperforms the state-of-the-art methods.

2 Related Works

Existing methods that have been proposed to automatically identify synonyms in text can be classified into three groups: supervised, unsupervised and knowledge-based methods. Supervised methods [15] use machine learning algorithms to assign concepts/terms to instances containing the ambiguous word. To employ these methods, it is inevitably to create training data for each target word to be disambiguated, which is usually labor-intensive and time-consuming. Knowledge-based methods mainly use information from an external knowledge source or a corpus of text [8]. The accuracy of semantic similarity measures heavily relies on the degree of knowledge completeness and structure sparseness of the adopted knowledge base.

In this study, we apply unsupervised methods which adopt the distributional characteristics of a corpus to compute the semantic similarity. Several studies have attempted to explore the text corpus information with unsupervised methods. A text corpus typically contains two types of context information: global context and local context. Global context carries topical information which can be utilized by topic models, e.g. Latent Dirichlet Allocation (LDA) and GloVe [13], to discover topic structures from the text corpus. Local context can train word embeddings such as NPLM [1] and Word2Vec [10] to capture semantic regularities reflected in the text corpus [8]. As a distributed representation of words, the basic idea of word embedding is to convert a word into a vector then project into a low-dimensional vector space, enabling the similar vectors in the same space share higher relevance.

3 Methodology

In this section, we propose an extensible word embedding model MedSim to conduct synonym identification of Chinese medical characters in detail, as Fig. 1 has briefly illustrated the general architecture. MedSim firstly transforms the input Chinese pairwise words into vectors using Word2vec. Then, features are extracted and mapped into low dimension vectors by feature-adjusted mapping function. Finally, word embeddings and feature embeddings are concatenated to identify synonyms. Components will be presented in detail in this section.

Fig. 1.
figure 1

MedSim architecture.

3.1 Feature Selection

Following our previous work [7], \(2^{13}\) experiments with or without one or more specific features were performed to observe the contribution of each feature and to identify which combinations of features are more effective. We employ the SVM classifier to classify several pairs of words based on the selected features to implement synonym identification, and unified measurements are precision, recall, and F1 values. A total of 1,526 pairs of synonyms are included as positive samples. Negative pairs are constructed from optional words in the modern Chinese dictionary. In the \(2^{13}\) combinations, we select the top 10% (sorted by F1 score) to observe the frequency of each feature (see Fig. 2). Features with higher frequency means they perform better in the combinations. Thus, we notice that the most useful features for our task include the Chinese cosine similarity, Radical, Normalized Baidu distance, Chinese edit distance, and the Pinyin Edit Distance.

3.2 Feature Embedding

Rule-Based Features. Rule-based features such as radicals and pronunciation can effectively reflect the similarities between Chinese characters and pronunciation. The application of Rule-based features which related to the linguistic information can reduce the requirement of the knowledge completeness of the adopted knowledge base, thus improving the synonym identification of low frequency words.

Fig. 2.
figure 2

Precision, recall and F1 of single feature in combinations.

Chinese Character Edit Distance. Shorter editing distance tend to imply synonym such as (they share an editing distance of 1). In this paper, we define relative Chinese Character Edit Distance as:

$$\begin{aligned} EditDist(A,B)=\frac{editDistance(A,B)}{maxLength(A,B)}\qquad \end{aligned}$$
(1)

where editDistance(AB) is the minimum number of edit operations from one string to another, and maxLength(AB) is the max length of A, B.

Pinyin Edit Distance. Pinyin refers to pronunciation of Chinese characters, in the medical domain, Pinyin can eliminate the difference brought by transliteration. Such as . This paper extracts pinyin from standard Xinhua Dictionary, we define relatively Pinyin Edit Distance based on Edit Distance:

$$\begin{aligned} pinyinEditDist=\frac{pinyinEditDistance(A,B)}{maxLength(A,B)}\qquad \end{aligned}$$
(2)

an example for pinyinEditDistance(AB) is pinyin for both and is “ai’bo’la’bing’du”, thus, making \(pinyinEditDist=0\).

Radical Distance. As a pictographic language, radicals of the Chinese characters has its their semantic meaning. Specifically, radicals which appear more frequently in the medical domain distinct from other domains such as “ (part of body), (bacterial related) and (diseases), compensating semantic information of words. And similarly:

$$\begin{aligned} CR=\frac{commonRadicals(A,B)}{maxLength(A,B)}\qquad \end{aligned}$$
(3)

where commonRadicals(AB) represents common Radicals of A and B, radicals information is extract carefully from standard Xinhua Dictionary.

Global Context Semantic Features. Previous information has based on single words, Normalized Google Distance (NGD) is the number of hits returned by the Google search engine for a given set of keywords. Keywords that have similar meaning tend to be “tight” in the Google distance unit. According to pre-experiment, we utilize Normalize Baidu Distance (NBD) for Chinese medical text, NBD between keywords A and B can be specified as:

$$\begin{aligned} NBD(A,B)=\frac{max(log f(A),log f(B)-log f(A,B))}{log M-min(log f(A),log f(B))}\qquad \end{aligned}$$
(4)

where M is the total numbers of web pages by searching, f(A) and f(B) is the hit times of A and B respectively, while f(AB) is time of co-occurrence.

In each combination of corpus and frequency extractors, the relative frequency of words or phrases is defined. The results returned by the search engine can be approximated as the relative usage frequency in practice. Thus, NBD approximately capture the semantic relevance of the two lexical items, and all web page retrieved by Baidu can be utilized by NBD.

3.3 Feature-Adjusted Mapping Function

Feature-adjusted mapping function is designed to adjust the impact of different features. By projecting the Rule-based and Global context semantic features into real-valued vectors, the mapping function \(y=f(x)\) is given by:

$$\begin{aligned} y_i = {\left\{ \begin{array}{ll} \frac{x_i}{thr}\qquad \text{ if } x_i < \text{ thr }\\ \ {\frac{1-2thr+x_i}{2(1-thr)}\qquad } \text{ if } x_i \ge \text{ thr } \end{array}\right. } \end{aligned}$$
(5)

where \(x_i\) is a real number feature with a range of [0, 1], y is the mapping output with a range of [−1, 1]. The threshold thr is set to 0.3.

In this piecewise function, when x is greater than thr, its corresponding y value tends to 1, indicating that the two words are very similar; when x is smaller than thr, y value tends to −1, which indicates that the two words are with lower similarity. A fuzzy similarity of 0.5 is set when x equals to thr.

For a pairwise word (w1, w2), given a feature x as a real number:

$$\begin{aligned} Vec_{w1}=[f(x),f(x),...,f(x)] \end{aligned}$$
(6)
$$\begin{aligned} Vec_{w2}=[1,1,...,1] \end{aligned}$$
(7)

where \(Vec_{w1}\) represents vector concatenated by transforming y from real-valued feature x following (5), and \(Vec_{w2}\) is an all 1 vector in respected to \(Vec_{w1}\). Length of \(Vec_{w1}\) and \(Vec_{w2}\) can be determined manually.

3.4 Extensible Embedding Model

Different features are transformed to vectors and then extended to existing word embedding vectors. The specific approach is to convert the selected features into vectors based on feature embedding representations by expanding in the vector dimension. For each feature extracted above, our model translates it into a real-valued vector, which is concatenated to the existing word vector representation by the following format:

$$\begin{aligned} word=[vec^w,vec^{f1},vec^{f2},...,vec^{fm}] \end{aligned}$$
(8)

where \(vec^{w}\) is word vectors by the existing Word2vec model, and \(vec^{fi}\) is the ith feature representation as illustrated before. Hence, MedSim can capture linguistic information and global context information based on the existing word vector which focus on local context information. Our model can better handle the problems of contextual solidification in texts in the medical domain as well as the low accuracy for words appears less frequently.

4 Experiment

4.1 Dataset and Experimental Setups

Datasets. The proposed approach is evaluated on synonym datasets extracted from ve authoritative Chinese medical websites and knowledge bases: A+ Medical Encyclopedia (A+ Medical)Footnote 1, Baidu Encyclopedia (Baidu)Footnote 2, Hudong Encyclopedia (Hudong)Footnote 3, Xywy-A Doctor-Patient interaction websiteFootnote 4 and China Disease Knowledge Total Database (CDD)Footnote 5. The five data sources can be described in detail in Table 1:

Table 1. Description of dataset sources in detail.

Implementation Details. We use a subset of our dataset mentioned before, which includes 42469 pairs of annotated examples of synonym. We adopt 7666 pairs out of them for the task of synonym identification, with unfamiliar entities removed. Trained Word2vec embeddings of 100 dimensions are adopted as word embeddings. When features are involved as combination information, each feature corresponds to a 10-dimensional vector. Thr of feature-enhanced mapping function was set as 0.4. In our experiment, the ratio of positive and negative samples is 1:1, and all negative samples are randomly selected from our dataset.

Evaluation. In this paper, the evaluation metrics are described as follows:

Correlation Coefficient. Spearman’s Rank Correlation Coefficient (\(\rho \)) and Pearson Correlation Coefficient (r) were used to evaluate the improvement effect of the model. These evaluation methods are widely used to evaluate the consistency between the results of automatic prediction and the manually labeled standard results.

$$\begin{aligned} \rho =1-\frac{6\sum _{i=1}^n{(R_{X_i}-R_{Y_i})^2}}{n(n^2-1)}\qquad \end{aligned}$$
(9)

The Spearman correlation coefficient shows the relative directions of X (independent variables) and Y (dependent variables). If Y tends to increase when X increases, the Spearman correlation coefficient is positive. if on the contrary, the Spearman correlation coefficient is negative. A Spearman correlation coefficient of zero indicates that Y does not have any tendency when X increases.

$$\begin{aligned} r=\frac{\sum _{i=1}^n{(X_i-\bar{X})(Y_i-\bar{Y})}}{\sqrt{\sum _{i=1}^n{(X_i-\bar{X})}}\sqrt{\sum _{i=1}^n{(Y_i-\bar{Y})}}}\qquad \end{aligned}$$
(10)

The Pearson correlation coefficient ranges from −1 to 1. A value of 1 means that X and Y are well described by the equation of a line, all data points follow a straight line, and Y increases as X increases; the value of −1 means that Y decreases as X increases; a value of 0 means no linear relationship between the two variables.

Precision, Recall and F1 Score. The Confusion Matrix is a widely used in the classification problem. According to the combination of the real category and the predicted category of the model, the sample is divided into four cases: True Positive, False Positive, True Negative, and False Negative. Precision (P) indicates whether predicted positive samples are true, While Recall (R) indicates how many positive examples are correctly predicted.

$$\begin{aligned} P=\frac{TP}{TP+FP}\qquad \end{aligned}$$
(11)
$$\begin{aligned} R=\frac{TP}{TP+FN}\qquad \end{aligned}$$
(12)

since they are two contradictory measures, we introduce F1 Score as:

$$\begin{aligned} F1=\frac{2*P*R}{P+R}\qquad \end{aligned}$$
(13)

ROC Curve. Points on the Receiver Operating Characteristic Curve reflects the susceptibility to the same input. According to the results predicted by the model, samples were ordered as positive examples. The False Positive rate (FP) and True Positive rate (TP) were obtained, and they were plotted as the horizontal and the vertical axis. Defined as:

$$\begin{aligned} FP=\frac{FP}{TN+FP}\qquad \end{aligned}$$
(14)
$$\begin{aligned} TP=\frac{TP}{TP+FN}\qquad \end{aligned}$$
(15)

Accuracy. The Accuracy can reflect the ability to determine True Positive and True Negative. And can be calculated by the following formula:

$$\begin{aligned} Accuracy=\frac{TP+TN}{TP+FN+FP+TN} \end{aligned}$$
(16)

4.2 Experimental Results

For all implemented method, we apply the same parameter settings as comparison, GloVe, Cilin+W2V [12], SVM-uni [9], BiNB [14], STS [4] and SOC-PMI [5] are selected and Table 2 compares MedSim proposed in this paper with other existing methods of synonym identification using our dataset. We observe that MedSim achieves the best results in both correlation coefficient and accuracy. Additionally, ablation test is reported to analysis the effectiveness of different features of MedSim in term of discarding features (w/o rule/global). Generally, both types of features contribute.

Table 2. Result of synonym identification (using accuracy)

From the results we can see that, MedSim has achieved a highest accuracy compared with state-of-art method. Multiple features added in this model is interpretable.

Table 3. Result of synonym identification (using correlation coefficient)

As shown in the result for correlation: (1) The PMI and PPMI model show poor correlation; (2) A comparison of the results of the two widely used word vector models shows that the results obtained by Word2vec are better correlated than GloVe; (3) Our base models, i.e., adding linguistic information and global context information separately, outperforms the baseline Word2vec, indicating that our method can aid in the performance of low-frequent words; (4) The utilization of search engines improves the performance of our model. This is within our expectation since search engines can provide background information and hidden relations beyond the context, which play crucial roles in human text comprehension. Therefore, our model has robust superiority over competitors (Table 3).

4.3 Analysis

Effect of Adding Rule-Based Features. In this section, we verified the distribution of similarity for both positive and negative samples in the model after combining the linguistic information. As in Fig. 3, the left panel shows the experimental result of the positive samples, and the right panel for negative samples. Comparing with the red curve, adding linguistic information can lead to a noticeable shift of the similarity distribution of the blue curve, indicating the improvement of the disambiguation identification. However, the homophonic word with different meanings cannot be well distinguished by the application of local rule-based feature (i.e. pinyin), thus the model (Word2vec+linguistic) does not operate well in certain situation (e.g. in x axis less than 0.0).

Fig. 3.
figure 3

Effect of adding rule-based features

Effective of Adding Global Context Features. Based on the previous rule-based feature embeddings, global feature of Normalized Baidu Distance is added to verify the effect of adding global context semantic features.

Fig. 4.
figure 4

Effect of adding global context semantic features

Similarly, we can see from Fig. 4 that: (1) For the blue curve, its peak shifts to the right and exceeds the given threshold obviously in the positive samples, while its peak shifts to the left for negative ones, indicating that words can be better distinguished; (2) Specifically, the left panel shows that adding global context semantic information can solve the problem caused by linguistic information alone presented in Fig. 3.

Fig. 5.
figure 5

Comparison for PR and ROC Curve with ablation

Model presented in this paper is labeled as MedSim (MedSimc for model based on CBOW and MedSims for Skip-gram). Experimental results above prove that adding global context information based on linguistic information can improve performance of our synonym identification model. Precision for adding rule-based features is 0.79246, and 0.77602 for Recall, which indicates that the global context information can improve Recall without reducing Precision. Global context information ensures the model to make better use of contextual semantic information, better solving the problem of different morphologies or pronunciations with same semantics. It can improve Precision and complement linguistic information (Fig. 5).

5 Conclusion

Motivated by the need to identify synonym accurately and effectively so as to reduce the requirement of large-scale knowledge-complete databases, we presented a novel approach to measure word similarity by capturing fine-grained linguistic and global context information. We carried out our model using feature embedding, such features have been proved by our experiment to improve the performance of existing methods especially for handling low frequency words, showing that these sources of information are complementary. Further research will include entity linking based on identified synonyms, contributing to the ontology completion.