skip to main content
research-article
Open Access

Enhancing Lexical Translation Consistency for Document-Level Neural Machine Translation

Authors Info & Claims
Published:13 December 2021Publication History

Skip Abstract Section

Abstract

Document-level neural machine translation (DocNMT) has yielded attractive improvements. In this article, we systematically analyze the discourse phenomena in Chinese-to-English translation, and focus on the most obvious ones, namely lexical translation consistency. To alleviate the lexical inconsistency, we propose an effective approach that is aware of the words which need to be translated consistently and constrains the model to produce more consistent translations. Specifically, we first introduce a global context extractor to extract the document context and consistency context, respectively. Then, the two types of global context are integrated into a encoder enhancer and a decoder enhancer to improve the lexical translation consistency. We create a test set to evaluate the lexical consistency automatically. Experiments demonstrate that our approach can significantly alleviate the lexical translation inconsistency. In addition, our approach can also substantially improve the translation quality compared to sentence-level Transformer.

Skip 1INTRODUCTION Section

1 INTRODUCTION

During the last few years, neural machine translation (NMT) has achieved remarkable progress and become the de facto standard paradigm of machine translation. A variety of effective NMT methods underlying the encoder-decoder framework have been proposed to improve sentence-level translation quality due to the powerful end-to-end modeling [1, 4, 20, 36, 52, 54]. However, when fed an entire document, standard NMT systems have to translate sentences in isolation without considering the cross-sentence dependencies. Consequently, document-level neural machine translation (DocNMT) methods are explored to utilize inter-sentence contextual information to improve performance over sentences in a document [11, 12, 14, 41].

When translating a document, NMT systems have to handle discourse phenomena between sentences to generate more coherent translations. It is a new challenge that sentence-level translation does not need to face. It is widely recognized that different languages and genres present different discourse phenomena. Therefore, we carefully analyze the performance of NMT on discourse phenomena in three genres of Chinese-to-English translation tasks in Section 2, which, to the best of our knowledge, has not been systematically studied. We find that the most obvious phenomenon in Chinese-to-English translation is lexical translation consistency, which means that the repeated source words prefer to share the same target translations in the document [44]. Table 1 shows an inconsistent example containing two sentences in Chinese-to-English translation. A sentence-level NMT (SentNmt [36]) system translates the same named entity “证监会” in both sentences into “CSRC” and “commission”, respectively. And the repeated notional word “准则 (Code)” is translated into two different words “guidelines” and “standards”, which harms the discourse cohesion seriously.

Table 1.

Table 1. An Example of Inconsistent Translations Indicated with the Underlines

In statistical machine translation (SMT), various approaches are proposed to encourage lexical translation consistency [7, 22, 28, 44]. However, it is seldom studied in NMT. Although existing DocNMT methods can alleviate the translation inconsistency through introducing cross-sentence contextual information, the problem is still serious. Majority of DocNMT methods mainly focus on the design of novel neural networks so that cross-sentence contextual information from different sources can be leveraged effectively [19, 21, 24, 25, 26, 31, 34, 37, 39, 48, 49, 51]. However, all contextual words are utilized in the same pattern. More specifically, existing DocNMT methods do not distinguish between repeated words and other words. As a result, they are not sensitive enough to the translation of repeated words that are usually regarded as the triggers for consistent translation [37, 44]. Recently, some researchers begin to focus on the evaluation and solution of discourse phenomena [2, 13, 38, 43]. Voita et al. [38] analyze the English-to-Russian subtitles dataset and propose a model that utilizes two-pass decoding to modify the results of sentence-level translation. The method improves the performance on four types of discourse phenomena including the translation consistency of entities. However, contextual words are still treated indiscriminately, and there are no explicit constraints on translation consistency. Meanwhile, the cases requiring consistent translations in Chinese-to-English tasks are more obvious and complex, not just limited to the translation of entities.

In this article, we propose an effective approach to enhance lexical translation consistency for the document-level translation (Section 3). We aim to explicitly provide consistent information from the global context so that the model can pay attention to repeated words and generate more consistent translations. Specifically, the approach consists of two modules that can be independent of the translation framework. First, a Global Context Extractor extracts two types of global context: document context and consistency context from all repeated source words. Then, Consistency Enhancers incorporate the global context into the output of encoder and decoder layers to enhance the lexical translation consistency.

We make the following contributions:

We analyze the discourse phenomena of Chinese-to-English document-level translation in different genres. And statistics show that lexical translation consistency is the most obvious phenomenon.

We explicitly model the lexical translation consistency for NMT. Repeated words and other words are treated differently, and two types of global context are utilized to provide a global and consistent constraint for each sentence. Independent of the translation network, our approach is easily adaptable to existing DocNMT models.

We create a test set to evaluate the lexical translation consistency automatically. Experiments show that our approach can effectively alleviate the lexical translation inconsistency and perform much better than existing DocNMT models. It can also significantly improve translation quality over the Transformer.

Skip 2OBSERVATION Section

2 OBSERVATION

We conduct a human study on the discourse phenomena to clarify the problems in Chinese-to-English DocNMT. We compare the distribution of discourse phenomena in three different genres: news, talks, and subtitles. The language in news is formal, while the style of subtitles is considered to be colloquial. The style of talks is somewhere in between.

Specifically, we train a sentence-level Transformer NMT model [36].1 To analyze the performance of SentNmt, we randomly select 100 paragraphs for each genre.2 Annotators are simultaneously shown the source-side paragraphs and their translations generated by NMT. Then we ask them to read the translations sentence-by-sentence. If a sentence is not appropriate, they are asked to refer to the standard answer and determine whether the errors can be resolved within the current sentence. We only focus on errors that have to be corrected with the aid of some contextual sentences.

2.1 Types of Discourse Phenomena

The results of manual analysis are shown in Table 2. For news and talks, lexical and tense consistency account for a major proportion. News tends to report objective events, where repeated entities and facts are frequently mentioned and usually require consistent translations. Tense inconsistency mainly occurs in the translation of declarative sentences, which are difficult to distinguish past and present tenses without tenses of contextual sentences. As a contrast, for colloquial subtitles, the pronoun translation, ellipsis, and ambiguity are significantly more obvious. In Chinese, a pro-drop language, zero anaphora exists widely and confuses the choice of pronouns.

Table 2.
Typelexicaltensepronounconnectiveellipsisambiguityother
consistencyconsistencytranslation
News43.9%24.5%9.2%4.6%6.9%5.4%5.5%
TED35.4%27.0%8.7%10.6%7.3%6.1%4.9%
Subtitles21.9%17.8%19.6%3.1%16.5%11.5%9.6%

Table 2. The Proportion of Different Types of Discourse Phenomena in Chinese-to-English Translations

Although different genres have different distribution of discourse phenomena, it can be found that lexical inconsistency is always one of the most serious issues in Chinese-to-English translations.

2.2 Lexical Translation Consistency

We further analyze the cases of lexical translation consistency in the real human references, which will guide us to design methods to alleviate serious lexical inconsistency. We analyze the references of above selected paragraphs for each genre.

2.2.1 Trigger of Translation Consistency..

Intuitively, the repeated source words are more likely to be translated into the same. However, one possible issue is that some non-repeated source words (usually pronoun or coreference) may also be forced to translate into the same. Fortunately, we find that the cases are rare, which are less than 15.2% in all cases of target-side consistency. Therefore, we focus on most cases where the repeated target words are translated from repeated source words. We regard the repeated source words as the triggers of translation consistency, which can be obtained easily by character matching.

In addition, for repeated phrases consisting of multiple words, considering the word-by-word translation paradigm of NMT, we regard the multiple words in one phrase as different words. And as long as one of the words is translated differently, the target phrase is inconsistent.

2.2.2 Types of Consistency..

Inspired by Guillou [9], we discuss the translation consistency of repeated source words with three different part-of-speech tags: nouns, verbs, and adjectives. Figure 1 Left shows the proportion of consistent translations (e.g., proportion = repeated at both source- and target-side repeated at source-side). It can be found that repeated nouns are more like to be translated consistently than verbs and adjectives. The repeated verbs and adjectives translated consistently are just over or less than half. Their translations are so flexible that it is difficult to determine the consistency. In contrast, the repeated nouns translated consistently are more than 75%. In news, a formal genre, the proportion is more obvious (about 87.5%). Inconsistent instances are usually due to the fact that some repeated source-side nouns are translated into coreferential words with different forms.

Fig. 1.

Fig. 1. The types of lexical translation consistency. Left: the proportion of consistent translations in repeated source words with different part-of-speech tags. Right: the proportion of different types in repeated nouns translated consistently.

In the nouns translated consistently, the named entities are common. As shown in Figure 1 Right, consistent named entities is about 46.4% in news. One reason for the large proportion of entities in subtitles is that the names of characters in movies are often called. Besides, there are many consistent general nouns related to the topic (which may be important to construct lexical chains). The cases are more than 50% in TED Talks.

In conclusion, the lexical translation consistency is the most significant phenomenon in Chinese-to-English document-level translations. Most of the repeated nouns are translated into the same, while verbs and adjectives are not. It inspires us to focus on the consistent translation of nouns.

Skip 3APPROACH Section

3 APPROACH

To alleviate the lexical translation inconsistency, our approach consists of two modules. Global Context Extractor extracts two types of global context (Section 3.1). The document context vectors are obtained by special tokens, and the consistency context vectors are generated from individual encoder states of repeated words3 in the document. Consistency Enhancers utilize the extracted global vectors to enhance the translation consistency at the outputs of encoder and decoder blocks (Section 3.2). In the decoder enhancer, we learn a consistency classifier to determine whether a repeated source word should be translated consistently. Our approach translates sentences in parallel. The overall architecture is shown in Figure 2.

Fig. 2.

Fig. 2. Overview of our approach. Sentences in a document are translated in parallel. The global document context vectors and consistency context vectors are obtained by Global Context Extractor (whose details are shown in the left dotted box) on the top of encoder blocks. Then, extracted context is utilized to modify the encoder states by Encoder Enhancer, and the prediction probability distribution by Decoder Enhancer ((whose details are shown in the right dotted box), respectively.

3.1 Global Context Extractor

We extract two types of global context: document context and consistency context.

Suppose that there are I sentences and N different repeated words in a document. The document context is a set of context-aware vectors , which provides the document-level contextual information for each sentence to improve translation quality. The generation of each document context vector does not distinguish whether words are repeated or not. Inspired by the popular pre-training language models such as GPT [29] and BERT [5], we add a special symbol “” at the beginning of each sentences. We believe the symbol can encode its sentence-level information well by the self-attention mechanism. After encoding, each hidden state of “” is extracted as the input to a Transformer layer (containing a multi-head self-attention sub-layer and feed forward network sub-layer) to model the dependencies among sentences in the document. Therefore, for each sentence, we can obtain a corresponding document context , where d indicates the hidden size.

The consistency context is a set of N global consistency-aware vectors . For each repeated word, the extractor collects all individual encoder states belonging to the word from all sentences to generate a global consistency context vector. Specifically, a token in sentence is encoded into an individual state . If belongs to the nth repeated word, the extractor extracts the state of into the corresponding set that stores all the states of words belonging to the nth repeated word in the entire document. Then, we generate a unique global consistency context vector for words belonging to the nth repeated word as follows:

(1)
where the element-wise max-pooling operation takes all states of words belonging to the nth repeated word as inputs and outputs a vector. The input size is variable.

3.2 Consistency Enhancer

We integrate the extracted global context vectors into the standard NMT model for two purposes. First, the source repeated words should know the information of each other in the encoding process. Second, the decoding probability distribution should be encouraged to be similar when translating the same repeated word. Therefore, we design two types of consistency enhancers to integrate the global context vectors to modify the encoder states at encoder-side and the prediction probability distribution at decoder-side, respectively.

3.2.1 Encoder Enhancer.

The encoder enhancer integrates the document context and the consistency context into the encoder states of source words. We arrange our encoder enhancer on the top of standard sentence-level encoder blocks underlying Transformer framework. For a word , whose encoder state is denoted by , there are two cases:

(1) If is a repeated word, we define the group number of repeated words to which belongs as , where . Then, is integrated with the corresponding consistency context vector via a gated sum operation as follows:

(2)
(3)
where denotes a feed forward network. stands for the sigmoid function. concatenates elements into a vector. The gate weight balances the sentence-level individual encoder state and the document-level shared consistency context.

(2) If is a non-repeated word, its encoder state is integrated with the corresponding document context instead of the consistency context under the same network defined by Equation (2) and Equation (3).

The special symbol “” directly copies as its final encoder state.

3.2.2 Decoder Enhancer.

The decoder enhancer aims to make the probability distribution of the translations more similar when translating the words that should be translated consistently. However, different from the encoder enhancer that has been informed of the repeated words in source sentences, the decoder enhancer faces two issues. (1) Not all repeated source words need to be translated consistently. (2) When decoding a target word at time t, it is unknown whether is translated from a repeated source word, or which repeated word should be translated.

Therefore, we leverage a consistency classifier and encoder-decoder attention weights to alleviate the two issues, respectively. Specifically, the final prediction probability distribution is computed in three steps. It is noted that the calculation of the first two steps is independent of the current decoding state, so they can be immediately performed after the global document context and consistency context are extracted.

Step1. For the nth repeated word, we generate a consistency probability distribution using the consistency context vector as follows:

(4)
where W and b are learnable parameters and we share them with the original NMT model. is a transfer matrix. The is supposed to constrain the original translation probability distribution at decoding time t.

Step2. We use a consistency classifier to estimate the consistent probability as the confidence of a repeated source word being translated consistently. We define it as a binary classification task. Therefore, for a word , there are two cases:

(1) If is a repeated word whose group number is , its confidence to be translated consistently is calculated by a two-layers perceptron as follows:

(5)
where , , , and are model parameters.

(2) If is a non-repeated word, the confidence is defined as .

Step3. We calculate the final prediction probability distribution with the aid of the encoder-decoder attention weights that bridge the current target word and the source words in . We average the encoder-decoder multi-head attention over heads and layers. Then the averaged attention weights are fed into a softmax function to output the normalized attention weight vector , . We denote the jth element as , which measures the contribution of the source word to the generation of target word .

The final probability distribution at time t is calculated by:

(6)

The following example explains the calculation of the final probability distribution. And the figure 3 shows the steps of the example.

Fig. 3.

Fig. 3. The calculation steps of the example used to explain the decoder enhancer.

Example. Suppose a document has two different repeated words. Therefore, we obtain two consistency distribution and at step1. A source sentence with five words is , where and belong to the first repeated word, belongs to the second, and and are non-repeated words. At step2, we can obtain consistent confidence of each word, assuming that they are respectively. For time t, the averaged encoder-decoder attention weight , and the original probability distribution is . Therefore, at step3, the final distribution .

3.3 Training

Our approach translates sentences in a document in parallel, and the consistency context is extracted over the document. Therefore, when training, we shuffle the data over document. Our model is trained in two stages, which has proved effective [25, 26, 34]. First, a standard sentence-level Transformer is pre-trained to ensure a fine initialization of encoder-decoder attention weights. Then, we add our global context extractor and consistency enhancers to optimize parameters. The newly introduced consistency classifier is trained together with the original translation components. Suppose a document has I sentences and N different repeated words, our optimization goal is to minimize the negative following log-likelihood loss:

(7)
where is the golden label of the mth word that belongs to the nth group of repeated words whose size is . If its translation is consistent, . Otherwise, . Section 5.1.2 describes the automatic annotation process of .

Skip 4TEST SET Section

4 TEST SET

It is generally acknowledged that standard machine translation metrics (e.g., BLEU) are not sensitive enough to discourse phenomena [42]. Recently, some works create contrastive test sets to evaluate specific phenomena [2, 13, 38]. Each test instance consists of a positive and several negative translations with incorrect phenomena. Models are evaluated by the proportion of instances whose generation probability of positive translation is higher than negative ones. Voita et al. [38] construct contrastive test sets for English-to-Russian subtitles to evaluate four types of discourse phenomena including the translation consistency of named entities (what they call lexical cohesion in their article). However, the hand-crafted test sets may not carry over to practical scenarios [17]. Their target-side context has been already assumed, so the test sets cannot evaluate the quality of generated context sentences. Meanwhile, they cannot evaluate the real translation results. In practice, the generation of subsequent sequence is affected by previous generated words.

As a result, we pick a test set from the real data to evaluate the lexical consistency of practical translations. Table 3 shows a test instance. Each test instance is a paragraph pair contains several consistent instances annotated manually. The annotation process contains four steps: (1) We collect triggers of translation consistency, i.e., repeated source-side nouns. (2) We extract triggers translated consistently, and record corresponding sentence indexes. For a repeated source word, we check its target-side words. If lemmas of corresponding target words are the same, it is a consistent instance.4 (3) We annotate the triggers’ types: named entity or general words. (4) We expand the possible translations of the repeated source word. We extract the top-20 candidate translations of the repeated source word from the lexical table of Moses [18], and keep the lemmas of candidates having correct meaning. We create a test set that contains 150 paragraphs for each of the domains: News, TED Talks, and Subtitles. Table 4 shows the statistics of consistent instances.

Table 3.
  • and inconsistent ones are marked.

Table 3. An Instance of Test Set Evaluating the Lexical Translation Consistency

  • and inconsistent ones are marked.

Table 4.
Genre#Para.# Consistent Instance (CI)CI Avg. len
TotalEntityGeneralPara Avg. CI
News1505372572803.282.67
TED1503171122052.112.84
Subtitles15018895931.252.13

Table 4. Statistics of Consistent Instances in the Test Set for Lexical Translation Consistency

When evaluating a result, we conduct a simple matching automatically. For each consistent instance in the tested paragraph, we use lemmas of generated sentences by NLTK toolkit. We check the generated sentences in the index list one-by-one, and extract the lemmas that belong to the candidate list. The number of lemmas extracted from each sentence must be equal to the number of times the sentence is indexed. For example, for the consistent instance triggered by “意识” in Table 3, “consciousness/mentality” has to appear twice in the translation of “<S7>”. After that, if extracted lemmas are exactly the same, we believe this instance is translated consistency. The accuracy of consistent instances is utilized to evaluate the ability of models resolving inconsistent translation.

Skip 5EXPERIMENTS Section

5 EXPERIMENTS

5.1 Data Preparation

5.1.1 Datasets.

For Chinese-to-English (ZhEn) translation, we evaluated our approaches on three different genres. For the news genre, we used News-Commentary v14 provided by WMT195 for training. newstest2017 and newstest2018 were used for development and testing, respectively. For talks, we used TED Talks in IWSLT17.6 We used dev-2010 as the development set and tst-20102013 as the test set, as Miculicich et al. [26] do. For subtitles, the dataset was collected by Wang et al. [40] from the subtitles of television episodes.7 We removed the sentences used to analyze discourse phenomena from the original training set.

For English-to-German (EnDe) translation, we conducted experiments on the same datasets Maruf et al. [25].8 Specifically, there were three datasets. TED Talks was also from the IWSLT17. We took tst-20162017 as the test set, and others as the development set. For the news genre, News-Commentary v11 corpus was used for training. The newstest2015 and newstest2016 in WMT were used as the development set and test set, respectively. Europarl was extracted from Europarl v7 and split by the SPEAKER tag.

The corpora statistics are listed in Table 5. All above datasets are provided document boundaries. Considering the memory limitation, the original documents with more than 16 sentences were forced to split into paragraphs, and we treated each paragraph as one document in our experiments.

Table 5.
ZhEnEnDe
DatasetNewsTEDSubtitlesNewsTEDEuroparl
Training0.31M0.23M2.14M0.24M0.21M1.67M
Development2.00K0.88K1.09K2.17K8.97K3.59K
Test3.98K6.05K1.15K3.00K2.27K5.13K
  • M: million. K: thousand.

Table 5. Statistics of the #sentence in Different Datasets

  • M: million. K: thousand.

Chinese sentences were segmented into words by our in-house toolkit. English and German datasets were tokenized by the Moses toolkit.9 Focusing on the consistency of nouns, we run Standford part-of-speech tagger [23] for source sentences and removed stop words to extract the repeated nouns. Words were segmented by byte-pair encoding with 30K merge operations [30]. It is noted that only the sub-words segmented from the repeated words are regarded as repeated words, and the same sub-words of different words are not shared. For example, the sub-word “No” in “Nosair” and the sub-word “No” in “Nordlund” belong to different repeated groups.

5.1.2 Annotation of Lexical Consistency.

To train the consistency classifier, we automatically annotated the repeated source words translated consistently by the alignment tool [6]. Specifically, we removed stop words and run the alignment tool for the source-reference pairs. A repeated source word is assumed to be consistent if the frequency of repetition in its aligned target words is equal to the frequency of repetition in source.10 In this way, we can annotate the repeated words that are translated consistently.

5.2 Baselines and Details

We compared our approach with following methods.

SentNmt [36] is a standard Transformer model with the “base” version parameters.

Cache [34] is a model utilizing the translation history. The model read the states of fixed-size generated words stored in a cache. Then, the weighted state is used to modify the decoder state in RNN framework. We re-implemented the method on Transformer. The cache size was set to 25 words suggested in their article.

DocT [51] encodes previous context sentences through an extra encoder, and introduces contextual information into each encoder and decoder Transformer layers.

HAN [26] adds a hierarchical attention network on the top of the last encoder and decoder layer to model sentence-level and word-level information in previous sentences. We adopted the “HAN encoder + HAN decoder” strategy that achieved the best performance.

SAN [25] calculates the weights of sentence-level and word-level context hierarchically. When calculating the attention weights, it utilizes the sparsemax function instead of softmax to focus on relevant sentences. We choose the “offline” model that use the context of entire document to integrate into the encoder with the “sparse-soft H-Attention”. The layers of encoder and decoder in their article were set to 4 but we set 6 in our experiments for a fair comparison with other models.

MmcNmt [53] encodes each source sentences independently and integrates the source-side context at the top of encoder. It translates a document sentence-by-sentence with Transformer-XL net to integrate the target-side history context.

X + Our stands for the models with our approach as attachments. Our global context extractor and consistency enhancers are independent of the translation model. Therefore, it can be added into existing DocNMT models.

We implemented all our models based on the toolkit THUMT [50].11 The parameters were the “base” version of the Transformer. Specifically, we used 6 layers of encoder and decoder with 8 attention heads. The hidden size h and feed-forward layer size were 512 and 2,048, respectively.

When training, we shuffled the data over paragraphs to ensure all sentences in a document were processed in parallel. The batch size was 3,000 tokens. We used the Adam optimizer with and . We employed label smoothing with a value of 0.1 and dropout with a rate of 0.1. During inference, we used multi-bleu.perl12 to compute the BLEU [27] score. The beam size was set to 4.

Skip 6RESULTS Section

6 RESULTS

6.1 Translation Quality

We first study the impact of our approach on translation quality. Table 6 shows the average BLEU scores on test sets.13 Towards improving lexical consistency that is not sensitive to BLEU, our approach should at least ensure that it will not negatively affect the translation quality.

Table 6.
ModelZhEnEnDe
NewsTEDSubtitlesNewsTEDEuroparl
SentNmt [36]13.1716.9729.1022.7823.2828.72
Cache [34]13.4517.3229.3423.3923.7129.25
DocT [51]13.7117.7529.8223.0824.0029.35
HAN [26]13.8917.9030.0225.0324.5829.58
SAN [25]13.8417.6930.0624.7624.2329.72
MmcNmt [53]14.0618.6430.1124.9125.1030.40
Our13.9317.7229.9024.5124.5329.63
HAN + Our14.1618.1530.2925.1124.8929.96
SAN + Our14.1918.0330.3824.9625.0330.07
  • Our approach is always significantly better than SentNmt and Cache. , , : statistically significantly (p-values 0.05) better than DocT, HAN, and SAN.

Table 6. Performance of Our Approach and Baselines on BLEU (%)

  • Our approach is always significantly better than SentNmt and Cache. , , : statistically significantly (p-values 0.05) better than DocT, HAN, and SAN.

Results show that our approach is superior to SentNmt significantly, with +0.76, +0.75, and +0.80 BLEU gains for ZhEn News, TED, and Subtitles, respectively. For EnDe, our approach still improves the BLEU scores by +1.73, +1.25, and +0.91 on News, TED, and Europarl, respectively.

Compared with existing DocNMT models HAN and SAN, our model achieves better or comparable translation quality. Although it is slightly lower than HAN and SAN in some datasets, the difference is not significant. Despite the higher BLEU achieved by MmcNmt, our goal-oriented approach performs better translation consistency (shown in Table 7) and is easy to combine with some existing models. When attached to HAN or SAN, our model does not reduce or even further improves the BLEU of original models.

Table 7.
ModelBLEUNews Acc.TED Acc.Subtitles Acc.Total Acc.
EntityGeneralEntityGeneralEntityGeneral
SentNmt19.8259.156.156.347.349.543.053.4
Cache [34]20.2462.360.460.752.752.649.557.7
DocT [51]20.5160.358.261.650.751.644.155.8
HAN [26]20.6958.058.659.851.252.648.455.8
SAN [25]20.7163.057.557.150.753.746.256.1
MmcNmt [53]20.8561.158.964.352.254.749.557.5
Our20.8768.563.670.559.061.152.763.4
  • Average BLEU scores (%) and accuracy (Acc.) of different genres of consistent instances (%) are reported.

Table 7. Performance of Our Approach and Baselines on the Test Set of Lexical Translation Consistency

  • Average BLEU scores (%) and accuracy (Acc.) of different genres of consistent instances (%) are reported.

Different from existing methods that utilize complex networks to capture the attention relationship between words in a long context sequence, our approach only uses a simple symbol to encode sentence-level contextual information. We suppose the improvement of BLEU by our model with global context mainly benefits from two points. First, the encoder states are enhanced by the global document contextual information. Second, the words that need to be consistent are translated more correctly, which directly affects the generation of subsequent sequences.

6.2 Translation Consistency

We then investigate the effectiveness of our approach to improve translation consistency on our specific lexical consistency test set.

As shown in Table 7, our approach performs best on both the translation quality and consistency when compared with other methods. The BLEU scores are averaged on three genres, and our approach is statistically significantly (p-values < 0.05) better than SentNmt, Cache, and DocT. SentNmt model only achieves total 53.4% consistency accuracy. Although existing DocNMT models leverage the cross-sentence context to achieve better BLEU, the improvement of consistency accuracy is still limited. Among them, Cache [34] and MmcNmt [53] perform better than other baselines, which is due to the utilization of translation history.

In contrast, our approach with encoder and decoder enhancers explicitly models the translation consistency and is more sensitive to repeated words. Compared with SentNmt model, the accuracy on the lexical consistency is 63.4%, which is better than the sentence-level model by 10.0%. Compared with other DocNMT models, the improvements of our approach in lexical consistency are also obvious. Our approach achieves the highest BLEU score on the test set. Meanwhile, the accuracy on the lexical consistency is significantly higher than other DocNMT methods.

In addition, we compare the performance of methods in different types of consistent lexical. For each genre, the translation consistency of general national words is harder to achieve than named entities. And compared with general words, the accuracy of named entities can obtain higher improvement through our approach. (For News, TED, and Subtitles, that is +9.4% vs. +7.4%, +14.2% vs. +11.7%, and +11.6% vs. +9.7%, respectively).

6.3 Effect of Consistency Enhancer

We compare the effect of different enhancers, and discuss the contributions of two types of global context. Experiments are conducted on consistency test set. The BLEU is averaged on three genres.

As shown in Table 8, the encoder enhancer and decoder enhancer behave differently. And interestingly, they seem to present a complementary relationship (row 2 vs. row 5). The encoder enhancer plays a more important role in the BLEU improvement, while the decoder enhancer is more helpful to enhance the consistency accuracy. The combination of two enhancers (row 6) is superior to single one in both translation quality and consistency.

Table 8.
#ModelBLEUAccuracy
1SentNmt [36]19.8253.4
2 Encoder Enhancer (EE)20.5956.5
3 consistency context20.4855.3
4 document context20.0354.6
5 Decoder Enhancer (DE)20.1259.1
6 EE DE20.8763.4

Table 8. Results of Different Enhancers on Average BLEU and Total Consistency Accuracy

The encoder enhancer introduces both document context and consistency context. The consistency context enhances repeated source words that can construct lexical chains throughout the document. However, due to the relative sparseness of repeated words, the improvement of using consistency context alone (row 4) is limited. As a contrast, document context are integrated into most of the encoder states to enhance source-side representation. During decoding, the symbol “” whose encoder state is corresponding document context vector can offer contextual information directly. As row 3 shows, document context can utilize general global information effectively to improve translation quality. But it is not sensitive to consistency.

The decoder enhancer is specifically designed for consistency. Results show that it can well constrain the model to produce consistent translations (row 5 vs. row 1). However, without the document context, decoder enhancer cannot obtain cross-sentence contextual information except for repeated words.

6.4 Sentences with Repeated Words

Figure 4 shows the BLEU gains over the SentNmt model on sentences with the different numbers of the repeated words in Chinese-to-English TED test sets. When a sentence contains repeated words, our approach always achieves the BLEU gains over SentNmt.

Fig. 4.

Fig. 4. BLEU gains over the SentNmt model (the black dotted lines) on the sentences with repeated words.

More importantly, with the increase of the number of repeated words in a sentence, the BLEU improvement of our approach (“Ours” in Figure 4) gets higher and higher. In contrast, the other three DocNMT models do not share this property. It proves that our approach can effectively utilize global and consistent information. The more repeated words, the closer the connection between current sentence and other sentences through the consistency context vectors, and the decoder enhancer can also impose stronger constraints on the generation of sequences.

6.5 Results of Consistency Classification

Considering that not all repeated words are consistent in target-side, our approach introduces a consistency classifier to explicitly determine whether a repeated word need to be translated consistently. Our test set has been annotated consistent instances, so it is used to measure the accuracy of the classification. Table 9 shows the results. Different from the accuracy of consistent instances in Table 7, the accuracy in Table 9 is calculated at the word level.

Table 9.
TotalNewsTEDSubtitles
EGTotalEGTotalEGTotal
Majority79.590.082.085.683.869.272.387.268.676.6
Our82.991.284.987.885.075.677.788.573.379.9
  • “Majority” means that all repeated words are forced to be translated consistently. E: entity, G: general.

Table 9. Accuracy of Predicting Whether Repeated Words Need to be Translated Consistently

  • “Majority” means that all repeated words are forced to be translated consistently. E: entity, G: general.

“Majority” force all repeated words to be translated the same. Its accuracy indicates that most repeated words are translated consistently in real, and named entities tend to be more consistent in translations than general nouns. The accuracy of entities is much higher than general words. Our classifier performs better than “Majority” with total +3.4% accuracy. It mainly benefits from the general words in informal genres, i.e., TED and Subtitles.

Actually, the expression is flexible when translating a repeated word, which makes consistent prediction difficult. This is one reason why we adopt softer strategies that utilize the classification probability as a confidence to weighted the distribution in Equation (6), rather than the hard ones specifying consistent words in advance.

6.6 Effect of Pre-training

Our approach applies the two-step training strategy, which has been widely used in DocNMT methods [25, 26, 31, 34, 51]. We discuss the effect of the pre-training in the news genre, so that the experiment using extra dataset can be conducted in the same domain. Table 10 shows the BLEU on the standard test set newstest2018, and the performance on our lexical consistency test sub-set in the news genre.

Table 10.
#Pre-trainingnewstest2018our News test set
LDCNewsBLEUBLEUAccuracy
1××13.70 (+0.53)13.82 (+0.61)62.4 (+4.9)
2×13.93 (+0.76)14.11 (+0.90)65.9 (+8.4)
320.05 (+6.88)19.85 (+6.64)68.3 (+10.8)
  • ✓ indicates that the corpus is used for pre-training, while × means not. Numbers in brackets are the gains compared with SentNmt.

Table 10. Results of Our Approach With and Without Pre-Training

  • ✓ indicates that the corpus is used for pre-training, while × means not. Numbers in brackets are the gains compared with SentNmt.

Compared with SentNmt, our approach trained from scratch (row 1) achieves significant +0.53 BLEU gains on standard test set, which is slightly lower than other models (shown in Table 6) using pre-training strategy [25, 26, 51]. On the targeted test set, the improvement of BLEU is +0.61. The accuracy of consistent instances is 62.4%, which goes beyond other DocNMT methods (shown in Table 7). With the pre-training using internal 0.31M News data (row 2), the performance is further improved. The improvement of consistent accuracy is +8.4%. In particular, when we use the mixed data of both News data and 2.0M extra LDC data to pre-train our model (row 3), the results are improved by +6.88 and +6.64 BLEU on the two test set, respectively. The overall results prove the effectiveness of the two-step training strategy.

The results show that the pre-training can improve both the translation quality and lexical consistency. On the targeted test set, the model gains 6.03 BLEU and 5.9% lexical consistency accuracy improvement (row 3 vs. row 1), respectively. We think the improvement of lexical consistency is mainly due to the improvement of translation quality of repeated words, especially repeated named entities, which benefits from the large-scale sentence-level pre-training.

6.7 Parameters and Speeds

Table 11 shows the parameters and speeds on ZhEn TED task. Our model introduces 5.2M extra parameters (that mainly come from the Transformer layer of document context in the global context extractor) to the SentNmt model. It translates all sentences in a document simultaneously. Because of the relatively simple network to utilize context, the decoding speed of our approach is similar to the sentence-level system, and is 15.9% faster than SAN.

Table 11.
Model#Params
SentNmt [36]75.2 M4,809353.8
HAN [26]87.9 M2,805261.3
SAN [25]82.6 M3,327294.7
Ours80.5 M3,765341.5

Table 11. Statistics of Parameters, Training, and Decoding Speeds (tokens/sec. )

6.8 Case Study

Table 12 shows two examples to demonstrate that our approach can constrain the model to translate the repeated source words consistently. In the first example, the repeated word “ 电力 ” in two sentences is translated into different forms (“power”, “electricity”, and “electric”) by DocNMT methods without consistent constraint. Compared with that, our approach generate the same translation “electric”. In the second example, only our approach can translate the repeated named entity “ 诺塞尔 ” into the consistent word “Nousl”.

Table 12.

Table 12. Examples to Show that Our Approach Generates Consistent Translations

6.9 Results of English-to-Russian Discourse Phenomena

Voita et al. [38] propose a two-pass CADec model to handle the scenarios where sentence-level parallel corpus is large-scale but document-level parallel corpus is rare. CADec utilizes both source-side context sentences and their target-side results translated by a SentNmt. Voita et al. construct English-to-Russian contrastive test sets to evaluate four types of discourse phenomena: deixis, lexical cohesion, inflection ellipses, and VP ellipses. Each test instance is assigned three given context sentences. Its evaluation method has been described in Section 4.

Their experiments show that CADec can improve the discourse phenomena well. However, the test set is not friendly to models using the source-side context. In their contrastive instances, the source-side context is the same, so results of models only using the source-side context cannot change with the target-side context. Therefore, to compare with their method, we add our approach to the original CADec. We use the same experiment settings and datasets. The accuracy of discourse phenomena is shown in Table 13.

Table 13.
ModeldeixislexicalinflectionVP ellipses
cohesionellipses
SentNmt [36]50.045.953.028.4
CADec [38]81.658.172.280.0
CADec + Our82.370.274.680.8

Table 13. Accuracy (%) of English-to-Russian Discourse Phenomena

Compared with original CADec, our approach can improve the performance of discourse phenomena. We mainly focus on the results of lexical cohesion, i.e., the translation consistency of named entities in our article. It can be found that our approach can improve the accuracy of lexical cohesion significantly. Our approach is lexical consistency oriented by explicitly modeling the consistency and distinguishing repeated and non-repeated words.

Skip 7RELATED WORK Section

7 RELATED WORK

Document-level translation is an important branch of machine translation [8, 16, 32, 33, 47]. In this article, our goal is to enhance the lexical translation consistency in DocNMT. Actually, this issue has been widely studied in the age of SMT. Xiao et al. [44] define the ambiguous words that need to be translated consistently, and re-translate them using words in candidate sets by two ways. Ture et al. [35] introduce three features to encourage consistency. Garcia et al. [7] design a feature that scores lexical consistency using word embeddings, and a change operation affecting how the translation search space is explored. Some researchers also systematically analyze the system behaviors on consistency [3, 9, 10, 28]. Other related works in SMT propose methods to improve the lexical cohesion, which mainly takes into account the repetition and hyponymy of words [45, 46]. With the rise of deep learning, NMT has surpassed SMT in many translation tasks. Some issues and their solutions applicable to the SMT framework should be reconsidered in the encoder-decoder framework. For DocNMT, our analysis has shown that lexical translation inconsistency is serious. However, there are few studies on the problem.

Existing DocNMT methods mainly focus on how to encode and use cross-sentence contextual information. Many works utilize fixed size previous source-side sentences as context [2, 12, 41]. Voita et al. [39] explore the context-aware model on the Transformer to show that cross-sentence attention can learn anaphora resolution. Zhang et al. [51] encode the context sentences with an extra encoder, and integrate them into each encoder and decoder layers. Miculicich et al. [26] propose a hierarchical attention network to model sentence-level attention. Yang et al. [49] use the capsule network to model relations between context. Researches also take advantage of previous target-side context [26, 38]. Kuang et al. [19] design dynamic and topic caches to store the word embedding of translation history and global topics, respectively. Tu et al. [34] utilize the hidden states of translation history. Voita et al. [38] add a second-pass decoder to leverage the source and the translations of SentNmt to improve the performance on discourse phenomena. They also propose a repair model to utilize target-side monolingual data to learn a document-level language model [37]. On the other hand, some other works explore a larger context in the entire document [15, 24, 31]. Maruf et al. [25] propose a hierarchical selective attention network. The attention weights are sharpened to attend sentences and words. Xiong et al. [48] model the discourse coherence of the entire document with a two-pass decoder and a reward teacher. Considering that the improvement of standard MT metrics cannot fully reflect the resolution of discourse phenomena, some works focus on the evaluation of phenomena [1, 38]. However, all existing DocNMT models are insensitive to repeated words in context. They fail to carefully analyze the translation consistency and explicitly model it.

Skip 8CONCLUSION AND FUTURE WORK Section

8 CONCLUSION AND FUTURE WORK

In this article, we analyze the discourse phenomena of Chinese-to-English translation in different genres. The analysis shows that lexical translation inconsistency is the most frequent errors in DocNMT. We also summarize the types of translation consistency, and create a test set to evaluate the lexical consistency automatically.

To alleviate the lexical inconsistency, we propose an explicit approach to enhance the lexical translation consistency. Specifically, we extract two types of global contextual information. Repeated source words in the document are extracted to generate consistent contextual vectors, which are integrated into encoder-side and decoder-side, to modify the encoder representation and constrain the generation of consistent translations, respectively. Compared with existing DocNMT models, experiments on different datasets show that our approach can substantially improve lexical translation consistency. Meanwhile, it can also improve translation quality of SentNmt significantly. In the future, we will explore the translation consistency at phrase-level and larger granularity.

Skip Acknowledgments Section

Acknowledgments

We thank anonymous reviewers for their insightful comments and suggestions.

Footnotes

  1. 1 In order to obtain good sentence-level translations, we mixed the training data of News, TED Talks, and Subtitles together to alleviate the lack of data. Experiments show that the translation quality is far superior to models trained on individual genre-specific datasets. The details of training and data processing are described in Section 5.

    Footnote
  2. 2 We treat each paragraph as a document in this article. Analyzed paragraphs come from the public test set newstest2019 in WMT2019 for news, and tst-20142015 in IWSLT2017 for TED talks. For subtitles, we extract the analysis set from the original training dataset provided by Wang et al. [40]. The selected sentences will not be used in training.

    Footnote
  3. 3 Actually, the proposed approach itself does not distinguish the part-of-speech in repeated words, so we use “repeated words” when introducing the method in the following. But in fact, as suggested in the conclusion of Section 2.2, we enhance and evaluate the translation consistency of “repeated nouns” in our experiments.

    Footnote
  4. 4 We do not consider the complex case where target words are partially repeated in this article.

    Footnote
  5. 5 http://data.statmt.org/news-commentary/v14.

    Footnote
  6. 6 https://wit3.fbk.eu/mt.php?release=2017-01-trnted.

    Footnote
  7. 7 https://github.com/longyuewangdcu/tvsub.

    Footnote
  8. 8 https://github.com/sameenmaruf/selective-attn/tree/master/data.

    Footnote
  9. 9 https://github.com/moses-smt/mosesdecoder/tree/master/scripts.

    Footnote
  10. 10 It is noted that we consider different target words with the same stem to be repetitive. And repeated source words with partially repeated translations are considered inconsistent.

    Footnote
  11. 11 https://github.com/thumt/THUMT.

    Footnote
  12. 12 https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl.

    Footnote
  13. 13 The significance test is conducted by the script “bootstrap-hypothesis-difference-significance.pl” in Moses.

    Footnote

REFERENCES

  1. [1] Bahdanau Dzmitry, Cho Kyunghyun, and Bengio Yoshua. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representation 2015.Google ScholarGoogle Scholar
  2. [2] Bawden Rachel, Sennrich Rico, Birch Alexandra, and Haddow Barry. 2018. Evaluating discourse phenomena in neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1. Association for Computational Linguistics, 13041313.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Carpuat Marine and Simard Michel. 2012. The trouble with SMT consistency. In Proceedings of the 7th Workshop on Statistical Machine Translation. Association for Computational Linguistics, 442449. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Cho Kyunghyun, Merrienboer Bart van, Gulcehre Caglar, Bahdanau Dzmitry, Bougares Fethi, Schwenk Holger, and Bengio Yoshua. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 17241734.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1. Association for Computational Linguistics, 41714186.Google ScholarGoogle Scholar
  6. [6] Dyer Chris, Chahuneau Victor, and Smith Noah A.. 2013. A simple, fast, and effective reparameterization of ibm model 2. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 644648.Google ScholarGoogle Scholar
  7. [7] Garcia Eva Martínez, Creus Carles, España-Bonet Cristina, and Màrquez Lluís. 2017. Using word embeddings to enforce document-level lexical consistency in machine translation. The Prague Bulletin of Mathematical Linguistics 108, 1 (2017), 8596.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Gong Zhengxian, Zhang Min, and Zhou Guodong. 2011. Cache-based document-level statistical machine translation. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 909919. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Guillou Liane. 2013. Analysing lexical consistency in translation. In Proceedings of the Workshop on Discourse in Machine Translation. Association for Computational Linguistics, 1018.Google ScholarGoogle Scholar
  10. [10] Itagaki Masaki, Aikawa Takako, and He Xiaodong. 2007. Automatic validation of terminology translation consistency with statistical method. InProceedings of the MT Summit XI. 269274.Google ScholarGoogle Scholar
  11. [11] Sebastien Jean and Kyunghyun Cho. 2019. Context-Aware learning for neural machine translation. CoRR, abs/1903.04715.Google ScholarGoogle Scholar
  12. [12] Sebastien Jean, Stanislas Lauly, Orhan Firat, and Kyunghyun Cho. 2017. Does neural machine translation benefit from larger context? CoRR, abs/1704.05135.Google ScholarGoogle Scholar
  13. [13] Jwalapuram Prathyusha, Joty Shafiq, Temnikova Irina, and Nakov Preslav. 2019. Evaluating pronominal anaphora in machine translation: An evaluation measure and a test suite. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 29642975.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Kang Xiaomian, Zhao Yang, Zhang Jiajun, and Zong Chengqing. 2020. Dynamic context selection for document-level neural machine translation via reinforcement learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 22422254.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Kang Xiaomian and Zong Chengqing. 2020. Fusion of discourse structural position encoding for neural machine translation. Chinese Journal of Intelligent Science and Technologie 2, 2 (2020), 144152.Google ScholarGoogle Scholar
  16. [16] Kang Xiaomian, Zong Chengqing, and Xue Nianwen. 2019. A survey of discourse representations for chinese discourse annotation. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 3 (2019), 125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Kim Yunsu, Tran Duc Thanh, and Ney Hermann. 2019. When and why is document-level context useful in neural machine translation? In Proceedings of the 4th Workshop on Discourse in Machine Translation. Association for Computational Linguistics, 2434.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Koehn Philipp, Hoang Hieu, Birch Alexandra, Callison-Burch Chris, Federico Marcello, Bertoldi Nicola, Cowan Brooke, Shen Wade, Moran Christine, Zens Richard, Dyer Chris, Bojar Ondřej, Constantin Alexandra, and Herbst Evan. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. Association for Computational Linguistics, 177180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Kuang Shaohui, Xiong Deyi, Luo Weihua, and Zhou Guodong. 2018. Modeling coherence for neural machine translation with dynamic and topic caches. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, 596606.Google ScholarGoogle Scholar
  20. [20] Luong Thang, Pham Hieu, and Manning Christopher D.. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 14121421.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Ma Shuming, Zhang Dongdong, and Zhou Ming. 2020. A simple and effective unified encoder for document-level machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 35053511.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Ma Yanjun, He Yifan, Way Andy, and Genabith Josef van. 2011. Consistent translation using discriminative learning: A translation memory-inspired approach. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 12391248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Manning Christopher D., Surdeanu Mihai, Bauer John, Finkel Jenny, Bethard Steven J., and McClosky David. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, 5560.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Maruf Sameen and Haffari Gholamreza. 2018. Document context neural machine translation with memory networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1. Association for Computational Linguistics, 12751284.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Maruf Sameen, Martins André F. T., and Haffari Gholamreza. 2019. Selective attention for context-aware neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1. Association for Computational Linguistics, 30923102.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Miculicich Lesly, Ram Dhananjay, Pappas Nikolaos, and Henderson James. 2018. Document-Level neural machine translation with hierarchical attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 29472954.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 311318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Pu Xiao, Mascarell Laura, and Popescu-Belis Andrei. 2017. Consistent translation of repeated nouns using syntactic and semantic cues. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Vol. 1. Association for Computational Linguistics, 948957.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Radford Alec, Narasimhan Karthik, Salimans Tim, and Sutskever Ilya. 2018. Improving Language Understanding by Generative Pre-Training. Technical report. OpenAI.Google ScholarGoogle Scholar
  30. [30] Sennrich Rico, Haddow Barry, and Birch Alexandra. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Vol. 1. Association for Computational Linguistics, 17151725.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Tan Xin, Zhang Longyin, Xiong Deyi, and Zhou Guodong. 2019. Hierarchical modeling of global context for document-level neural machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 15761585.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Tu Mei, Zhou Yu, and Zong Chengqing. 2013. A novel translation framework based on rhetorical structure theory. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Vol. 2. Association for Computational Linguistics, 370374.Google ScholarGoogle Scholar
  33. [33] Tu Mei, Zhou Yu, and Zong Chengqing. 2014. Enhancing grammatical cohesion: Generating transitional expressions for SMT. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Vol. 1. Association for Computational Linguistics, 850860.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Tu Zhaopeng, Liu Yang, Shi Shuming, and Zhang Tong. 2018. Learning to remember translation history with a continuous cache. Transactions of the Association for Computational Linguistics 6 (2018), 407420. DOI: https://doi.org/10.1162/tacl_a_00029Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Ture Ferhan, Oard Douglas W., and Resnik Philip. 2012. Encouraging consistent translation choices. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 417426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Ł. ukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 59986008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Voita Elena, Sennrich Rico, and Titov Ivan. 2019. Context-Aware monolingual repair for neural machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 877886.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Voita Elena, Sennrich Rico, and Titov Ivan. 2019. When a good translation is wrong in context: Context-aware machine translation improves on deixis, ellipsis, and lexical cohesion. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 11981212.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Voita Elena, Serdyukov Pavel, Sennrich Rico, and Titov Ivan. 2018. Context-Aware neural machine translation learns anaphora resolution. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1. Association for Computational Linguistics, 12641274.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Wang Longyue, Tu Zhaopeng, Shi Shuming, Zhang Tong, Graham Yvette, and Liu Qun. 2018. Translating pro-drop languages with reconstruction models. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. AAAI Press, 19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Wang Longyue, Tu Zhaopeng, Way Andy, and Liu Qun. 2017. Exploiting cross-sentence context for neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 28262831.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Wong Billy T. M. and Kit Chunyu. 2012. Extending machine translation evaluation metrics with lexical cohesion to document level. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 10601068. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Wong KayYen, Maruf Sameen, and Haffari Gholamreza. 2020. Contextual neural machine translation improves translation of cataphoric pronouns. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 59715978.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Xiao Tong, Zhu Jingbo, Yao Shujie, and Zhang Hao. 2011. Document-level consistency verification in machine translation. In Proceedings of the Machine Translation Summit, Vol. 13. 131138.Google ScholarGoogle Scholar
  45. [45] Xiong Deyi, Ben Guosheng, Zhang Min, Lv Yajuan, and Liu Qun. 2013. Modeling lexical cohesion for document-level machine translation. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence. AAAI Press, 21832189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Xiong Deyi, Ding Yang, Zhang Min, and Tan Chew Lim. 2013. Lexical chain based cohesion models for document-level statistical machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 15631573.Google ScholarGoogle Scholar
  47. [47] Xiong Deyi, Zhang Min, and Wang Xing. 2015. Topic-based coherence modeling for statistical machine translation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 3 (2015), 483493. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Xiong Hao, He Zhongjun, Wu Hua, and Wang Haifeng. 2019. Modeling coherence for discourse neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 73387345. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Yang Zhengxin, Zhang Jinchao, Meng Fandong, Gu Shuhao, Feng Yang, and Zhou Jie. 2019. Enhancing context modeling with a query-guided capsule network for document-level translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 15271537.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Jiacheng Zhang, Yanzhuo Ding, Shiqi Shen, Yong Cheng, Maosong Sun, Huanbo Luan, and Yang Liu. 2017. Thumt: An open source toolkit for neural machine translation. CoRR, abs/1706.06415.Google ScholarGoogle Scholar
  51. [51] Zhang Jiacheng, Luan Huanbo, Sun Maosong, Zhai FeiFei, Xu Jingfang, Zhang Min, and Liu Yang. 2018. Improving the transformer translation model with document-level context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 533542.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Zhao Yang, Zhang Jiajun, Zong Chengqing, He Zhongjun, and Wu Hua. 2019. Addressing the under-translation problem from the entropy perspective. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence. AAAI Press, 451458. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Zheng Zaixiang, Yue Xiang, Huang Shujian, Chen Jiajun, and Birch Alexandra. 2020. Towards making the most of context in neural machine translation. In Proceedings of the 29th International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization, 39833989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Zhou Long, Zhang Jiajun, and Zong Chengqing. 2019. Synchronous bidirectional neural machine translation. Transactions of the Association for Computational Linguistics 7, 5 (2019), 91105. DOI: https://doi.org/10.1162/tacl_a_00256Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Enhancing Lexical Translation Consistency for Document-Level Neural Machine Translation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 3
      May 2022
      413 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3505182
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 December 2021
      • Accepted: 1 September 2021
      • Revised: 1 April 2021
      • Received: 1 August 2020
      Published in tallip Volume 21, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed
    • Article Metrics

      • Downloads (Last 12 months)399
      • Downloads (Last 6 weeks)72

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format