A Method of Unknown Words Processing for Neural Machine Translation Using HowNet

Li, Shaotong; Xu, JinAn; Zhang, Yujie; Chen, Yufeng

doi:10.1007/978-981-10-7134-8_3

Shaotong Li¹¹,
JinAn Xu¹¹,
Yujie Zhang¹¹ &
…
Yufeng Chen¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 787))

Included in the following conference series:

China Workshop on Machine Translation

514 Accesses
1 Citations

Abstract

An inherent weakness of neural machine translation (NMT) systems is their inability to correctly translate unknown words. Traditional unknown words processing methods are usually based on word vectors trained on large scale of monolingual corpus. Replacing the unknown words according to the similarity of word vectors. However, it suffers from two weaknesses: Firstly, the resulting vectors of unknown words are not of high quality; Secondly, it is difficult to deal with polysemous words. This paper proposes an unknown word processing method by integrating HowNet. Using the concepts and sememes in HowNet to seek the replacement words of unknown words. Experimental results show that our proposed method can not only improves the performance of NMT, but also provides some advantages compared with the traditional unknown words processing methods.

Download conference paper PDF

A Semantic Concept Based Unknown Words Processing Method in Neural Machine Translation

The Solution of the Problem of Unknown Words Under Neural Machine Translation of the Kazakh Language

Research for Uyghur-Chinese Neural Machine Translation

Keywords

End-to-End NMT is a kind of machine translation method proposed in recent years [1,2,3,4]. Most of the NMT systems are based on the encoder-decoder framework, the encoder encodes the source sentence into a vector, and the decoder decodes the vector into the target sentence. Compared with the traditional statistical machine translation (SMT), NMT has many advantages, and has shown greatly performance in many translation tasks.

But NMT still has the problem of unknown words which is caused by the limited vocabulary scale. In order to control the temporal and spatial expenses of the model, NMT usually uses small vocabularies in the source side and the target side [5]. The words that are not in the vocabulary are unknown words, which will be replaced by an “UNK” symbol. A feasible method to solve this problem is to find out the substitute in-vocabulary words of the unknown words. Li et al. proposed a replacing method based on word vector similarity [5], the unknown words are replaced by the synonyms in the vocabulary through the cosine distance of the word vector and the language model. However, there are some unavoidable problems with this method. Firstly, the vectors of rare words are difficult to train; Secondly, the trained word vectors cannot express various semantics of the polysemous words and cannot adapt to the replacement of the polysemous words in different contexts.

To solve these problems, this paper proposes an unknown words processing method based on HowNet. This method uses HowNet’s concepts and sememes as well as language models to calculate the semantic similarity between words and select the best alternative words to replace the unknown words.

Experiments on English to Chinese translation tasks demonstrate that our proposed method can achieve more than 2.89 BLEU points over the baseline system, and also outperform the traditional method based on word vector similarity by nearly 0.7 BLEU points.

The main contributions of this paper are shown as follows:

An external bilingual semantic dictionary is integrated to solve the problem of unknown words in NMT.
The semantic concepts and sememes in HowNet are used to obtain the replacement word, which can solve the problem of rare words and polysemous words better.
A similarity model which integrates the language models and HowNet is proposed. It not only ensures that the replacement words are close to the unknown words in semantic level, but also keeps the semantic completeness of the source sentence as much as possible.

1 NMT and the Problem of Unknown Words

In this section, we will introduce NMT and the impact of the unknown words on NMT.

1.1 Neural Machine Translation with Attention

Most of the proposed NMT systems are based on the encoder-decoder framework and attention mechanism which learn to soft-align and translate jointly [4].

The encoder consists of a bidirectional recurrent neural network (Bi-RNN), which can read a source sequence X(x ₁ , …, x _t ) and generate a sequence of forward hidden states $ (\vec{h}_{t} , \ldots ,\vec{h}_{\tau } ) $ and a sequence of backward hidden states $ (\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h}_{t} , \ldots ,\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h}_{t} ). $ We obtain the annotation h _i for each source word x _i by concatenating the forward hidden state $ \vec{h}_{i} $ and the backward hidden state $ \overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\leftarrow}$}}{h}_{i} . $

The decoder consists of a recurrent neural network (RNN), an attention network and a logical regression network. At each time step i, the RNN generates the hidden state s _i based on the previous hidden state s _i-1, the previous predicted word y _i-1, and the context vector c _i which is calculated as a weighted sum of the source annotations by the attention network. Then the logical regression network predicts the target word y _i.

1.2 The Problem of Unknown Words

When predicting the target word at each time step, it is necessary to generate the probability of all the words in the target vocabulary. Therefore, the output dimension of the logical regression network is equal to the target vocabulary size, the total computational complexity will grow almost proportional to the vocabulary size. So train the model with the whole vocabulary is infeasible, which leads to the problem of unknown words caused by the limitation of the vocabulary size.

In NMT system, the unknown words mainly lead to two problems: Firstly, the NMT model is difficult to learn the representation and the appropriate translation of the unknown words, the parameter quality of the neural network is poor. Secondly, the existence of the unknown words increases the ambiguity of the source sentence, affects the accuracy of attention network and the quality of translation result.

2 Our Method

This paper proposes an unknown words processing method with HowNet. The framework of our method is shown in Fig. 1.

In the training phase, we will first learn a similarity model from the monolingual corpus and HowNet, which is used to evaluate the semantic similarity between words. At the same time, we need to align the bilingual corpus and extract a bilingual dictionary. Then we use the similarity model and bilingual dictionary to replace unknown words both in source side and target side. Finally, we train a NMT model with the replaced bilingual corpus.

In the testing phrase, we will first replace the unknown words with the similarity model. After replacement, we use the trained NMT model to translate the replaced input sentences. During translating, the alignment probabilities of each target word are obtained by the attention network. Finally, the translation of replaced words will be restored by the alignment probabilities and the bilingual dictionary.

This section mainly introduces HowNet and the details of our proposed method.

2.1 HowNet

HowNet^{Footnote 1} is a widely used computable Chinese and English semantic dictionary. In HowNet, the formal description of words is organized in three layers: “word”, “concept”, and “sememe”. The words are expressed by concepts, concepts are defined by sememes which are well-designed by the author of HowNet. That is to say, all the concepts are made up of different combinations of sememes.

“Concept” is a description of lexical semantics. Each word can be expressed as several concepts. “Concept” is described by a kind of knowledge expressing language, which is composed by sememes. “Sememe” is the basic semantic unit, all the sememes are organized into a hierarchical tree structure by the Hypernym-Hyponym relations.

We use HowNet 2008 in our experiments, there are 1700 sememes, 28925 concepts, 96744 Chinese words and 93467 English words.

2.2 Similarity Model

The replacement words should not only be close to the unknown words in semantics, but also should keep the semantics of the original sentence as much as possible. Therefore, this paper defines a semantic similarity model by integrating the language models and HowNet, to calculate the semantic similarity between the in-vocabulary words and the unknown words, and then selects the best replacement words.

We trained a 3-gram language models in our experiment, so for an unknown word w _i and its candidate replacement word $ w_{i}^{\prime } $, where i refers the position of w in the sentence, the score on the 3-gram language model is defined as formula 1:

$$ {Score_{{3}{\hbox{-}}{gram}}} (w^{\prime}_{i} ,w_{i} ) { = }\frac{{p(w^{\prime}_{i} \, |\,w_{i - 1} ,w_{i - 2} )+ p(w_{i + 1} \, |\,w^{\prime}_{i} ,w_{i - 1} )+ p(w_{i + 2} \, |\,w_{{{\text{i}} + 1}} ,w^{\prime}_{i} )}}{3} $$

(1)

We use the method proposed by Liu and Li [6] to calculate the semantic similarity of word pair $ (w_{i}^{\prime } ,w_{i} ) $ in HowNet, which is defined as:

$$ Sim_{HowNet} (w_{1} ,w_{2} ) = \mathop {\hbox{max} }\limits_{i = 1 \ldots n,j = 1 \ldots m} \sum\limits_{I = 1}^{4} {\beta_{I} } \prod\limits_{J}^{I} {Sim_{J} (S_{1i} ,S_{2j} )} $$

(2)

where S _1i and S _2j are all the corresponding concepts of w ₁ and w ₂, Sim(S _1i ,S _2j ) are the partial similarities. β are adjustable parameters. Details of this formula are described in reference [6].

This similarity calculation method defines the similarity between words as the maximum similarity of all corresponding concepts. Converting the similarity calculation between words to the similarity calculation between concepts. Then concept similarity is decomposed into the combination of some sememe similarities through the semantic expression of concept. The sememe similarity is calculated by semantic distance between sememes which is obtained by the Hypernym-Hyponym relations.

This method has the following advantages: Firstly, the concepts can express various semantics of polysemous words; secondly, as long as the unknown words are included in HowNet, the difference between rare words and common words can be eliminated effectively.

The semantic similarity of the word pair $ (w_{i}^{\prime } ,w_{i} ) $ is finally defined as formula 3:

$$ Sim(w_{i}^{\prime } ,w_{i} ) = \sqrt {Score_{{3{\hbox{-}}gram}} (w_{i}^{\prime } ,w_{i} ) \cdot Sim_{HowNet} (w_{i}^{\prime } ,w_{i} )} $$

(3)

In word alignment phrase, we only reserve the aligned pair with the highest probability for each word. So the aligned bilingual corpus only contains aligned word pairs with one-to-one mapping or one-to-null mapping. Therefore, when replacing aligned word pairs which contains unknown words, we need to handle three cases:

Both source side and target side are unknown words: In this case, we must consider bilingual word similarity. That is to say, only the translation pair which is similar to the original pair in both source and target side will be selected. For an aligned word pair (s _i,t _j), the score to replace them with alternative pair $ (s_{i}^{\prime } ,t_{j}^{\prime } ) $ is calculated as formula 4:
$$ Score = \frac{{Sim(s_{i} ,s_{i}^{\prime } ) + Sim(t_{j} ,t_{j}^{\prime } )}}{2} $$
(4)
One side is unknown word and another side is in-vocabulary word: In this case, we only replace the unknown word s _i. But we should also consider bilingual word similarity. Only the word which is similar to the original word and its translation is similar to the align word will be selected. So we will firstly get the translation of alternative replacement word $ s_{i}^{\prime } $ which is defined as $ T_{{s_{i}^{\prime } }} $ by the bilingual dictionary. And calculate the replacement score with formula 5:
$$ Score = \frac{{Sim(s_{i} ,s_{i}^{\prime } ) + Sim(t_{j} ,T_{{s_{i}^{\prime } }} )}}{2} $$
(5)
One side is unknown word and another side is null: In this case, we only consider monolingual word similarity, so we will simply regard the similarity of unknown word w _i and it’s alternative replacement word $ w_{i}^{\prime } $ as the replacement score:
$$ Score = Sim(w_{i} ,w_{i}^{\prime } ) $$
(6)

Finally, the word pair or word in the in-vocabulary words with highest replacement score will be chosen to replace the unknown words.

2.3 Restore Unknown Words

NMT model is a sequence to sequence model, we can only find the most likely alignment through the attention network. However, the performance of the attention network in NMT model is very unstable. In order to reduce the effect of alignment errors, a judging operation is added to the alignment: We align the words in the training corpus with GIZA++ [7] to get a bilingual dictionary, which will contain all words in training corpus and their translations. For the word t _i in the output sentence, if t _i aligns to a replaced word s _j, the previously obtained bilingual dictionary will be used to determine the correctness of the alignment: If word pair (s _i , t _i) is in the bilingual dictionary, then the alignment is correct, then replace t _i with the translation of original source word. Otherwise t _i will be kept in the output sentence.

3 Experiments

Since HowNet is a Chinese and English semantic dictionary, we verify our method on the English to Chinese translation task.

3.1 Settings

The bilingual data used to train NMT model is selected from the CWMT2015^{Footnote 2} English-Chinese news corpus, including 1.6 million sentence pairs. The development set and test set are officially provided by CWMT2015, each with 1000 sentences. In order to shorten the training time, the sentence pairs longer than 50 words either on the source side or on the target side will be filtered out. The word alignment is also carried out on the training set. The language models and the word vectors will be trained on the monolingual data, which contains 5 million sentences selected from the CWMT2015 English-Chinese news corpus, both on source language and target language.

We use the BLEU score [8] to evaluate the translation results.

3.2 Training Details

The hyper parameters of our NMT system are described as follows: the vocabulary size of the source side is limited to 20k, and the target side, 30k. The number of hidden units is 512 for both encoder and decoder. The word embedding dimension of the source and target words is 512. The parameters are updated with Adadelt algorithm [9]. The Dropout method [10] is used at the readout layer, and the dropout rate is set as 0.5.

3.3 Comparative Experiments and Main Results

There are 5 different systems in our comparative experiments:

1.
Moses [11]: An open-source phrase-based SMT system with default configuration.
2.
RNNSearch: Our baseline NMT system with improved attention mechanism [12].
3.
PosUnk: Add a method proposed by Luong et al. [13] to the baseline NMT system in order to process unknown words.
4.
w2v&lm_restore: Based on our baseline NMT system, use the method proposed by Li et al. [5] to replace the unknown words based on word vectors and the language models. The word vectors are trained by word2vec [14] toolkit, and the 3-gram language models with modified Kneser-Ney smoothing is trained by SRILM [15].
5.
hn&lm_restore: Based on the baseline NMT system, our method will use HowNet and the language models to replace the unknown words. The language models used are the same as the language models used in system 4.

The main experimental results are shown in Table 1.

Table 1. BLEU scores (%) of different systems

Full size table

As we can see, our system (hn&lm_restore) performs poor on the experiment data. It slightly improves the baseline NMT, while the performance is worse than the other unknown words processing methods. The reason is that more than two-thirds of the unknown words in the experiment data are not contained in HowNet, these unknown words cannot be replaced by our method. To make our method more effective, we select another experiment data from the CWMT2015 English-Chinese news corpus, in which most of the unknown words are contained in HowNet. We refer this data as HowNet adapted data. The new data include the training set of 1 million sentences, the development set of 1000 sentences and test set of 1000 sentences. The experimental results on HowNet adapted data are shown in Table 2.

Table 2. BLEU scores (%) of different systems on HowNet adapted data

Full size table

On HowNet adapted data, our system (hn&lm_restore) outperforms the baseline system (RNNSearch) by 2.89 BLEU on average; In addition, it surpasses the NMT system which add a simple unknown word processing module (PosUnk) by 1.25 BLEU points, it significantly improves the NMT system of traditional method (w2v&lm_rest- ore) by 0.7 BLEU points.

Clearly, our method is effective on HowNet adapted corpus, these results show the effectiveness of our proposed method. As HowNet continues to expand and improve, our approach will become more useful on more corpus.

3.4 Comparison of Translating Details

Here we compare the translating details of our system with other systems, we mainly analyze the translating process of unknown words. The translation instances are shown in Table 3.

Table 3. Translation instances table

Full size table

The main advantage of our system is that the replacement words selected by our system are more appropriate. In eg1 and eg2, the unknown words are word with tense (amazingly) or compound word (never-ending). These unknown words break the semantic continuity of source sentences. What’s worse, these words are rare words, which means their word vectors are not well trained. So that traditional replacement methods change the original meaning of source sentences, affect the subsequent translations, result in over translation or unfluent translation.

However, these rare words are contained in HowNet. Our proposed method finds more appropriate replacement words, keeps the original meaning of source sentences better and provides less impact on subsequent translations. After restoring, we can finally obtain translations which are very close to the references.

Although our method can handle most of the unknown words, there still remain some unsolved unknown words. In eg3, the number 110,000 is not contained in HowNet, our method cannot deal with this kind of items. For this case, we can only replace unknown words in post processing.

4 Conclusion and Future Work

This paper proposes an unknown words processing method in NMT by integrating concepts and sememes in HowNet and language models. This method has advantages in dealing with rare words and polysemous words, it not only improves the translation of the unknown words in NMT, but also ensures the semantic completeness of the original sentence. Experiments on English to Chinese translation show that our method not only achieves a significant improvement over the baseline NMT, but also provides some advantages compared with the traditional unknown words processing methods.

Our future work mainly contains two aspects. Firstly, our proposed method relies on the coverage of HowNet on corpus, improving this coverage will be leaved as our future work. Secondly, the replacement method proposed in this paper is limited to the replacement of word level, we are going to challenge the phrase level method.

Notes

References

Kalchbrenner, N., Blunsom, P.: Recurrent continuous translation models (2013)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. Adv. Neural. Inf. Process. Syst. 4, 3104–3112 (2014)
Google Scholar
Luong, M.T., Pham, H., Manning, C.D.: On the properties of neural machine translation: encoder-decoder approaches. Computer Science (2014)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. Computer Science (2014)
Google Scholar
Li, X., Zhang, J., Zong, C.: Towards zero unknown word in neural machine translation. In: International Joint Conference on Artificial Intelligence, pp. 2852–2858. AAAI Press (2016)
Google Scholar
Liu, Q., Li, S.: Word similarity computing based on Hownet. Comput. Linguist. Chin. Lang. Process. 7(2), 59–76 (2002)
Google Scholar
Och, F.J., Ney, H.: A Systematic Comparison of Various Statistical Alignment Models. MIT Press, Cambridge (2003)
MATH Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for auto matice valuation of machine translation. In: Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, July 2002, pp. 311–318 (2002)
Google Scholar
Zeiler, M.D.: ADADELTA: an adaptive learning rate method. Computer Science (2012)
Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., et al.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MATH MathSciNet Google Scholar
Collins, M., Koehn, P.: Clause restructuring for statistical machine translation. In: Meeting on Association for Computational Linguistics. Association for Computational Linguistics, pp. 531–540 (2005)
Google Scholar
Meng, F., Lu, Z., Li, H., et al.: Interactive attention for neural machine translation (2016)
Google Scholar
Luong, M.T., Sutskever, I., Le, Q.V., et al.: Addressing the rare word problem in neural machine translation. Bull. Univ. Agric. Sci. Vet. Med. Cluj-Napoca 27(2), 82–86 (2014)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., et al.: Distributed representations of words and phrases and their compositionality. Adv. Neural. Inf. Process. Syst. 26, 3111–3119 (2013)
Google Scholar
Stolcke, A.: SRILM—an extensible language modeling toolkit. In: International Conference on Spoken Language Processing, pp. 901–904 (2002)
Google Scholar

Download references

Acknowledgments

The authors are supported by the National Nature Science Foundation of China (Contract 61370130 and 61473294), and Beijing Natural Science Foundation under Grant No. 4172047, and the Fundamental Research Funds for the Central Universities (2015JBM033), and the International Science and Technology Cooperation Program of China under grant No. 2014DFA11350.

Author information

Authors and Affiliations

School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China
Shaotong Li, JinAn Xu, Yujie Zhang & Yufeng Chen

Authors

Shaotong Li
View author publications
You can also search for this author in PubMed Google Scholar
JinAn Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yujie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yufeng Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to JinAn Xu .

Editor information

Editors and Affiliations

University of Macau, Macau SAR, China
Derek F. Wong
Soochow University, Suzhou, China
Deyi Xiong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, S., Xu, J., Zhang, Y., Chen, Y. (2017). A Method of Unknown Words Processing for Neural Machine Translation Using HowNet. In: Wong, D., Xiong, D. (eds) Machine Translation. CWMT 2017. Communications in Computer and Information Science, vol 787. Springer, Singapore. https://doi.org/10.1007/978-981-10-7134-8_3

Download citation

DOI: https://doi.org/10.1007/978-981-10-7134-8_3
Published: 14 November 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7133-1
Online ISBN: 978-981-10-7134-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Method of Unknown Words Processing for Neural Machine Translation Using HowNet

Abstract

Similar content being viewed by others

A Semantic Concept Based Unknown Words Processing Method in Neural Machine Translation

The Solution of the Problem of Unknown Words Under Neural Machine Translation of the Kazakh Language

Research for Uyghur-Chinese Neural Machine Translation

Keywords

1 NMT and the Problem of Unknown Words

1.1 Neural Machine Translation with Attention

1.2 The Problem of Unknown Words

2 Our Method

2.1 HowNet

2.2 Similarity Model

2.3 Restore Unknown Words

3 Experiments

3.1 Settings

3.2 Training Details

3.3 Comparative Experiments and Main Results

3.4 Comparison of Translating Details

4 Conclusion and Future Work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Method of Unknown Words Processing for Neural Machine Translation Using HowNet

Abstract

Similar content being viewed by others

A Semantic Concept Based Unknown Words Processing Method in Neural Machine Translation

The Solution of the Problem of Unknown Words Under Neural Machine Translation of the Kazakh Language

Research for Uyghur-Chinese Neural Machine Translation

Keywords

1 NMT and the Problem of Unknown Words

1.1 Neural Machine Translation with Attention

1.2 The Problem of Unknown Words

2 Our Method

2.1 HowNet

2.2 Similarity Model

2.3 Restore Unknown Words

3 Experiments

3.1 Settings

3.2 Training Details

3.3 Comparative Experiments and Main Results

3.4 Comparison of Translating Details

4 Conclusion and Future Work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation