Keywords

End-to-End NMT is a kind of machine translation method proposed in recent years [1,2,3,4]. Most of the NMT systems are based on the encoder-decoder framework, the encoder encodes the source sentence into a vector, and the decoder decodes the vector into the target sentence. Compared with the traditional statistical machine translation (SMT), NMT has many advantages, and has shown greatly performance in many translation tasks.

But NMT still has the problem of unknown words which is caused by the limited vocabulary scale. In order to control the temporal and spatial expenses of the model, NMT usually uses small vocabularies in the source side and the target side [5]. The words that are not in the vocabulary are unknown words, which will be replaced by an “UNK” symbol. A feasible method to solve this problem is to find out the substitute in-vocabulary words of the unknown words. Li et al. proposed a replacing method based on word vector similarity [5], the unknown words are replaced by the synonyms in the vocabulary through the cosine distance of the word vector and the language model. However, there are some unavoidable problems with this method. Firstly, the vectors of rare words are difficult to train; Secondly, the trained word vectors cannot express various semantics of the polysemous words and cannot adapt to the replacement of the polysemous words in different contexts.

To solve these problems, this paper proposes an unknown words processing method based on HowNet. This method uses HowNet’s concepts and sememes as well as language models to calculate the semantic similarity between words and select the best alternative words to replace the unknown words.

Experiments on English to Chinese translation tasks demonstrate that our proposed method can achieve more than 2.89 BLEU points over the baseline system, and also outperform the traditional method based on word vector similarity by nearly 0.7 BLEU points.

The main contributions of this paper are shown as follows:

  • An external bilingual semantic dictionary is integrated to solve the problem of unknown words in NMT.

  • The semantic concepts and sememes in HowNet are used to obtain the replacement word, which can solve the problem of rare words and polysemous words better.

  • A similarity model which integrates the language models and HowNet is proposed. It not only ensures that the replacement words are close to the unknown words in semantic level, but also keeps the semantic completeness of the source sentence as much as possible.

1 NMT and the Problem of Unknown Words

In this section, we will introduce NMT and the impact of the unknown words on NMT.

1.1 Neural Machine Translation with Attention

Most of the proposed NMT systems are based on the encoder-decoder framework and attention mechanism which learn to soft-align and translate jointly [4].

The encoder consists of a bidirectional recurrent neural network (Bi-RNN), which can read a source sequence X(x 1 , …, x t ) and generate a sequence of forward hidden states \( (\vec{h}_{t} , \ldots ,\vec{h}_{\tau } ) \) and a sequence of backward hidden states \( (\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h}_{t} , \ldots ,\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h}_{t} ). \) We obtain the annotation h i for each source word x i by concatenating the forward hidden state \( \vec{h}_{i} \) and the backward hidden state \( \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h}_{i} . \)

The decoder consists of a recurrent neural network (RNN), an attention network and a logical regression network. At each time step i, the RNN generates the hidden state s i based on the previous hidden state s i-1 , the previous predicted word y i-1 , and the context vector c i which is calculated as a weighted sum of the source annotations by the attention network. Then the logical regression network predicts the target word y i .

1.2 The Problem of Unknown Words

When predicting the target word at each time step, it is necessary to generate the probability of all the words in the target vocabulary. Therefore, the output dimension of the logical regression network is equal to the target vocabulary size, the total computational complexity will grow almost proportional to the vocabulary size. So train the model with the whole vocabulary is infeasible, which leads to the problem of unknown words caused by the limitation of the vocabulary size.

In NMT system, the unknown words mainly lead to two problems: Firstly, the NMT model is difficult to learn the representation and the appropriate translation of the unknown words, the parameter quality of the neural network is poor. Secondly, the existence of the unknown words increases the ambiguity of the source sentence, affects the accuracy of attention network and the quality of translation result.

2 Our Method

This paper proposes an unknown words processing method with HowNet. The framework of our method is shown in Fig. 1.

Fig. 1.
figure 1

Framework of our unknown words processing method

In the training phase, we will first learn a similarity model from the monolingual corpus and HowNet, which is used to evaluate the semantic similarity between words. At the same time, we need to align the bilingual corpus and extract a bilingual dictionary. Then we use the similarity model and bilingual dictionary to replace unknown words both in source side and target side. Finally, we train a NMT model with the replaced bilingual corpus.

In the testing phrase, we will first replace the unknown words with the similarity model. After replacement, we use the trained NMT model to translate the replaced input sentences. During translating, the alignment probabilities of each target word are obtained by the attention network. Finally, the translation of replaced words will be restored by the alignment probabilities and the bilingual dictionary.

This section mainly introduces HowNet and the details of our proposed method.

2.1 HowNet

HowNetFootnote 1 is a widely used computable Chinese and English semantic dictionary. In HowNet, the formal description of words is organized in three layers: “word”, “concept”, and “sememe”. The words are expressed by concepts, concepts are defined by sememes which are well-designed by the author of HowNet. That is to say, all the concepts are made up of different combinations of sememes.

“Concept” is a description of lexical semantics. Each word can be expressed as several concepts. “Concept” is described by a kind of knowledge expressing language, which is composed by sememes. “Sememe” is the basic semantic unit, all the sememes are organized into a hierarchical tree structure by the Hypernym-Hyponym relations.

We use HowNet 2008 in our experiments, there are 1700 sememes, 28925 concepts, 96744 Chinese words and 93467 English words.

2.2 Similarity Model

The replacement words should not only be close to the unknown words in semantics, but also should keep the semantics of the original sentence as much as possible. Therefore, this paper defines a semantic similarity model by integrating the language models and HowNet, to calculate the semantic similarity between the in-vocabulary words and the unknown words, and then selects the best replacement words.

We trained a 3-gram language models in our experiment, so for an unknown word w i and its candidate replacement word \( w_{i}^{\prime } \), where i refers the position of w in the sentence, the score on the 3-gram language model is defined as formula 1:

$$ {Score_{{3}{\hbox{-}}{gram}}} (w^{\prime}_{i} ,w_{i} ) { = }\frac{{p(w^{\prime}_{i} \, |\,w_{i - 1} ,w_{i - 2} )+ p(w_{i + 1} \, |\,w^{\prime}_{i} ,w_{i - 1} )+ p(w_{i + 2} \, |\,w_{{{\text{i}} + 1}} ,w^{\prime}_{i} )}}{3} $$
(1)

We use the method proposed by Liu and Li [6] to calculate the semantic similarity of word pair \( (w_{i}^{\prime } ,w_{i} ) \) in HowNet, which is defined as:

$$ Sim_{HowNet} (w_{1} ,w_{2} ) = \mathop {\hbox{max} }\limits_{i = 1 \ldots n,j = 1 \ldots m} \sum\limits_{I = 1}^{4} {\beta_{I} } \prod\limits_{J}^{I} {Sim_{J} (S_{1i} ,S_{2j} )} $$
(2)

where S 1i and S 2j are all the corresponding concepts of w 1 and w 2 , Sim(S 1i ,S 2j ) are the partial similarities. β are adjustable parameters. Details of this formula are described in reference [6].

This similarity calculation method defines the similarity between words as the maximum similarity of all corresponding concepts. Converting the similarity calculation between words to the similarity calculation between concepts. Then concept similarity is decomposed into the combination of some sememe similarities through the semantic expression of concept. The sememe similarity is calculated by semantic distance between sememes which is obtained by the Hypernym-Hyponym relations.

This method has the following advantages: Firstly, the concepts can express various semantics of polysemous words; secondly, as long as the unknown words are included in HowNet, the difference between rare words and common words can be eliminated effectively.

The semantic similarity of the word pair \( (w_{i}^{\prime } ,w_{i} ) \) is finally defined as formula 3:

$$ Sim(w_{i}^{\prime } ,w_{i} ) = \sqrt {Score_{{3{\hbox{-}}gram}} (w_{i}^{\prime } ,w_{i} ) \cdot Sim_{HowNet} (w_{i}^{\prime } ,w_{i} )} $$
(3)

In word alignment phrase, we only reserve the aligned pair with the highest probability for each word. So the aligned bilingual corpus only contains aligned word pairs with one-to-one mapping or one-to-null mapping. Therefore, when replacing aligned word pairs which contains unknown words, we need to handle three cases:

  • Both source side and target side are unknown words: In this case, we must consider bilingual word similarity. That is to say, only the translation pair which is similar to the original pair in both source and target side will be selected. For an aligned word pair (s i ,t j ), the score to replace them with alternative pair \( (s_{i}^{\prime } ,t_{j}^{\prime } ) \) is calculated as formula 4:

    $$ Score = \frac{{Sim(s_{i} ,s_{i}^{\prime } ) + Sim(t_{j} ,t_{j}^{\prime } )}}{2} $$
    (4)
  • One side is unknown word and another side is in-vocabulary word: In this case, we only replace the unknown word s i . But we should also consider bilingual word similarity. Only the word which is similar to the original word and its translation is similar to the align word will be selected. So we will firstly get the translation of alternative replacement word \( s_{i}^{\prime } \) which is defined as \( T_{{s_{i}^{\prime } }} \) by the bilingual dictionary. And calculate the replacement score with formula 5:

    $$ Score = \frac{{Sim(s_{i} ,s_{i}^{\prime } ) + Sim(t_{j} ,T_{{s_{i}^{\prime } }} )}}{2} $$
    (5)
  • One side is unknown word and another side is null: In this case, we only consider monolingual word similarity, so we will simply regard the similarity of unknown word w i and it’s alternative replacement word \( w_{i}^{\prime } \) as the replacement score:

    $$ Score = Sim(w_{i} ,w_{i}^{\prime } ) $$
    (6)

Finally, the word pair or word in the in-vocabulary words with highest replacement score will be chosen to replace the unknown words.

2.3 Restore Unknown Words

NMT model is a sequence to sequence model, we can only find the most likely alignment through the attention network. However, the performance of the attention network in NMT model is very unstable. In order to reduce the effect of alignment errors, a judging operation is added to the alignment: We align the words in the training corpus with GIZA++ [7] to get a bilingual dictionary, which will contain all words in training corpus and their translations. For the word t i in the output sentence, if t i aligns to a replaced word s j , the previously obtained bilingual dictionary will be used to determine the correctness of the alignment: If word pair (s i , t i ) is in the bilingual dictionary, then the alignment is correct, then replace t i with the translation of original source word. Otherwise t i will be kept in the output sentence.

3 Experiments

Since HowNet is a Chinese and English semantic dictionary, we verify our method on the English to Chinese translation task.

3.1 Settings

The bilingual data used to train NMT model is selected from the CWMT2015Footnote 2 English-Chinese news corpus, including 1.6 million sentence pairs. The development set and test set are officially provided by CWMT2015, each with 1000 sentences. In order to shorten the training time, the sentence pairs longer than 50 words either on the source side or on the target side will be filtered out. The word alignment is also carried out on the training set. The language models and the word vectors will be trained on the monolingual data, which contains 5 million sentences selected from the CWMT2015 English-Chinese news corpus, both on source language and target language.

We use the BLEU score [8] to evaluate the translation results.

3.2 Training Details

The hyper parameters of our NMT system are described as follows: the vocabulary size of the source side is limited to 20k, and the target side, 30k. The number of hidden units is 512 for both encoder and decoder. The word embedding dimension of the source and target words is 512. The parameters are updated with Adadelt algorithm [9]. The Dropout method [10] is used at the readout layer, and the dropout rate is set as 0.5.

3.3 Comparative Experiments and Main Results

There are 5 different systems in our comparative experiments:

  1. 1.

    Moses [11]: An open-source phrase-based SMT system with default configuration.

  2. 2.

    RNNSearch: Our baseline NMT system with improved attention mechanism [12].

  3. 3.

    PosUnk: Add a method proposed by Luong et al. [13] to the baseline NMT system in order to process unknown words.

  4. 4.

    w2v&lm_restore: Based on our baseline NMT system, use the method proposed by Li et al. [5] to replace the unknown words based on word vectors and the language models. The word vectors are trained by word2vec [14] toolkit, and the 3-gram language models with modified Kneser-Ney smoothing is trained by SRILM [15].

  5. 5.

    hn&lm_restore: Based on the baseline NMT system, our method will use HowNet and the language models to replace the unknown words. The language models used are the same as the language models used in system 4.

The main experimental results are shown in Table 1.

Table 1. BLEU scores (%) of different systems

As we can see, our system (hn&lm_restore) performs poor on the experiment data. It slightly improves the baseline NMT, while the performance is worse than the other unknown words processing methods. The reason is that more than two-thirds of the unknown words in the experiment data are not contained in HowNet, these unknown words cannot be replaced by our method. To make our method more effective, we select another experiment data from the CWMT2015 English-Chinese news corpus, in which most of the unknown words are contained in HowNet. We refer this data as HowNet adapted data. The new data include the training set of 1 million sentences, the development set of 1000 sentences and test set of 1000 sentences. The experimental results on HowNet adapted data are shown in Table 2.

Table 2. BLEU scores (%) of different systems on HowNet adapted data

On HowNet adapted data, our system (hn&lm_restore) outperforms the baseline system (RNNSearch) by 2.89 BLEU on average; In addition, it surpasses the NMT system which add a simple unknown word processing module (PosUnk) by 1.25 BLEU points, it significantly improves the NMT system of traditional method (w2v&lm_rest- ore) by 0.7 BLEU points.

Clearly, our method is effective on HowNet adapted corpus, these results show the effectiveness of our proposed method. As HowNet continues to expand and improve, our approach will become more useful on more corpus.

3.4 Comparison of Translating Details

Here we compare the translating details of our system with other systems, we mainly analyze the translating process of unknown words. The translation instances are shown in Table 3.

Table 3. Translation instances table

The main advantage of our system is that the replacement words selected by our system are more appropriate. In eg1 and eg2, the unknown words are word with tense (amazingly) or compound word (never-ending). These unknown words break the semantic continuity of source sentences. What’s worse, these words are rare words, which means their word vectors are not well trained. So that traditional replacement methods change the original meaning of source sentences, affect the subsequent translations, result in over translation or unfluent translation.

However, these rare words are contained in HowNet. Our proposed method finds more appropriate replacement words, keeps the original meaning of source sentences better and provides less impact on subsequent translations. After restoring, we can finally obtain translations which are very close to the references.

Although our method can handle most of the unknown words, there still remain some unsolved unknown words. In eg3, the number 110,000 is not contained in HowNet, our method cannot deal with this kind of items. For this case, we can only replace unknown words in post processing.

4 Conclusion and Future Work

This paper proposes an unknown words processing method in NMT by integrating concepts and sememes in HowNet and language models. This method has advantages in dealing with rare words and polysemous words, it not only improves the translation of the unknown words in NMT, but also ensures the semantic completeness of the original sentence. Experiments on English to Chinese translation show that our method not only achieves a significant improvement over the baseline NMT, but also provides some advantages compared with the traditional unknown words processing methods.

Our future work mainly contains two aspects. Firstly, our proposed method relies on the coverage of HowNet on corpus, improving this coverage will be leaved as our future work. Secondly, the replacement method proposed in this paper is limited to the replacement of word level, we are going to challenge the phrase level method.