Keywords

1 Introduction

Neural Machine Translation (NMT) [2, 16, 17] has made significant progress. However, even the best trained translation models still make unpredictable errors in practical applications [3]. Figure 1 illustrates the fragility of NMT. Robustness is the feature that a model can maintain some performance despite perturbations or noise. For machine translation tasks, robustness refers to the ability of the model to adapt to new corpus. The lack of model training and noise learning ability leads to the model generating a completely different translation after adding certain perturbations to the sentence, which seriously affects the model performance. The original method improves the model robustness by manually compiling error features [7, 18], but it is too costly and some features are inapplicable for tasks of machine translation.

Fig. 1.
figure 1

Fragility of neural machine translation. Leaving out a Chinese character “ ” and word “ ” lead to significant errors in Mongolian translation.

Adversarial example is a momentous tool for exploring the robustness of deep learning systems and it’s initially applied in the field of computer vision [14]. Recently, some researchers utilizing adversarial examples to Natural Language Processing (NLP) tasks  [4, 5, 19], which currently include character-level, word-level, phrase-level, and sentence-level adversarial examples. It makes the model produce error output by adding carefully designed perturbations to the input data. In general, the generation of adversarial examples implies that the model uses non-robust features, resulting in a less robust model as well. Adversarial training is performed by data augmentation methods, where adversarial examples are proportionally blended into the training set. In this way, the model obtained by training on the new dataset learns these non-robust features, resulting in a more robust model. Thus for machine translation tasks, we can quickly obtain a large amount of parallel data that can be applied for robustness analysis by using the input of an existing parallel corpus to generate adversarial examples along the output of the source text.

However, unlike images that directly use the gradient optimization to obtain adversarial examples, the sentence space in NLP is discrete, so it’s difficult to disturb along the gradient update direction when generating adversarial examples for text. On the other hand, if the common noise introduction such as adding, deleting, and modifying words is used to perturb against the source input, the generated adversarial examples aren’t only difficult to ensure sentence fluency and semantic consistency, but also may even degrade the model performance. Especially for low-resource translation tasks, its own lack of massively parallel corpus has led to poor model performance, poor robustness and weak adaptation to new corpus or sentences with noise. Therefore, to improve the robustness of low-resource translation tasks, this paper utilizes reinforcement learning to generate adversarial examples and uses discriminators as terminal signals in the environment to further constrain semantics. Furthermore, we also add a language model to evaluate the fluency of the adversarial examples. The method learns how to apply discrete perturbations at the token-level to directly reduce translation quality. The experimental results on the CCMT2019 Mongolian-Chinese and CWMT2017 Uighur-Chinese show that using the adversarial examples generated by this method to fine-tune the model can significantly improve its performance.

2 Background and Related Work

2.1 Neural Machine Translation

Neural machine translation (NMT) mainly utilizes the encoder-decoder structure to achieve semantic encoding of the source language and prediction of the target language. The specific way is to use an Encoder to encode the input source language \(x=(x_1,...,x_n)\) into a fixed vector, and then use Decoder to decode the vector to finally get the target language. For \(y_{t}\), given its previous word sequence \(y_{<t}\) and the source language sentence x, use P(y|x) to determine the probability of the current target word \(P(y_{t} |y_{<t},x)\). The specific calculation process is shown in Eq. (1):

$$\begin{aligned} \begin{aligned} P(y_{t} |y_{<t},x)\propto exp(y_{t};r_{t};C_{t}) \end{aligned} \end{aligned}$$
(1)

where \(r_{t}\) is the hidden layer state of the neural machine translation model Encoder at time t. \(C_{t}\) is the context state information of the generated word \(y_{t}\) defined according to the hidden layer node state of Encoder. NMT is trained using Maximum Likelihood Estimation (MLE). Given J training sentence pairs \(\left\{ x^{i} ,y^{i} \right\} _{i=1}^{N}\) , at each time step, NMT generates the target word \(y_{t}\) by maximizing the translation probability on the source sentence x. The training objective is to maximize the Eq. (2):

$$\begin{aligned} \begin{aligned} L_{MLE}= \sum _{i=1}^{N}logp(y^{i} |x^{i}) =\sum _{i=1}^{N}\sum _{t=1}^{M}logp(y_{t}^{i}|y_{1}^{i} ...y_{t-1}^{i},x^{i}) \end{aligned} \end{aligned}$$
(2)

2.2 Adversarial Example, Adversarial Attack and Adversarial Training in NLP

Adversarial Example can be described as \(\hat{x}\), which is obtained by adding a restricted perturbation of \(\delta \) to the original input sample \(\left( x,y \right) \) and cause model deterioration. For an original sample \(\left( x,y \right) \), there exists its adversarial sample set \(A\left( x,y \right) \), and its expression is shown in Eq. (3):

$$\begin{aligned} \begin{aligned} \mathcal {A}\left( x,y \right) =\left\{ \hat{x} |R\left( \hat{x},x \right) \le \delta \wedge M\left( \hat{x} \right) \ne y \right\} \end{aligned} \end{aligned}$$
(3)

where \(R\left( \hat{x},x \right) \) represents the vector perturbed between the disturbed sample \(\hat{x}\) and the original sample x. “Restricted perturbation” requires that \(R\left( \hat{x},x \right) \) to be constrained by \(\delta \). The model M is generally non-robust, which makes it possible that when the model inputs a sample \(\hat{x}\) with minor perturbations, the resulting \(M\left( \hat{x} \right) \) is completely different from the original output \(M\left( x \right) \). The generation of adversarial samples is usually associated with perturbations of non-robust features.

Adversarial Attack is the process of generating adversarial examples \( \hat{x}_{1}, \hat{x}_{2},..., \hat{x}_{n}\in \mathcal {A} \left( x,y \right) \) against model M and \(\left( x,y \right) \) sample. It aims to look for the non-robust and useful features to perturb x, and finally make the model produce error output. According to the attack environment, adversarial attack can be divided into black-box attack and white-box attack. For black-box attack, the attack algorithm can only access the output of M without knowing the parameters and structure information. For white-box attack, the attack algorithm can access all the information and parameters of M and generate adversarial examples based on the gradient of M to attack the network. In NLP task, due to the discreteness of sentence space, it is difficult to disturb effectively with gradient information, so white box attack is difficult.

Adversarial Training is the process in which adversarial examples are generated on the training set through adversarial attacks for data enhancement and the enhanced data is used to retrain model M, so it can be defined as an optimization problem, and the model is expected that both performance and robustness will be enhanced. The original non-robust features may become useless after adversarial training, thereby weakening the association between non-robust features and labels, and achieving the purpose of anti-disturbance model.

2.3 Genetic Algorithm-Based Adversarial Attack

GeneticAttack [1] is a black-box adversarial attack method that performs word-level perturbation on examples, and uses genetic algorithm to optimize the examples. Inspired by the theory of biological evolution, the core of genetic algorithm lies in population mutation, crossover and selection. The population of GeneticAttack consists of several sentences x, and the size of the population is limited by hyperparameters. The mutation operation is completed by synonym replacement, and synonyms are obtained through an independently obtained word-embedding matrix. In mutation operations, GeneticAttack additionally uses a language model to filter out inappropriate word substitutions. The crossover operation takes out two sentences in the population and randomly selects words from one of them in the position of each word to form a new sentence. The new sentence collection forms the next generation population. The selection fitness is the output \(M\left( \hat{x} \right) \).

2.4 Gradient-Based Adversarial Attack

HotFlip [6] is a typical white-box attack method, which uses gradient ascending to directly select the largest disturbance among the acceptable perturbations by limiting the degree of perturbation, thereby generating adversarial samples quickly and efficiently. HotFlip performs one-hot encoding on sentences, which are represented as a three-dimensional tensor, in which each word corresponds to a matrix, and each column in the matrix is a one-hot character vector. It has the advantage of allowing character substitutions to be represented using a tensor of the same size as the sentence, as well as a tensor representation of character substitutions from the gradient tensor. During character replacement, Hotflip directly selects the character closest to the gradient direction for replacement. The insertion and deletion operations are completed by character substitutions of words.

3 Adversarial Examples Based on Reinforcement Learning

Adversarial examples must be semantically consistent with the original input, while causing the model to produce incorrect output. GeneticAttack (Sect. 2.3) generates adversarial examples through synonym substitution, and the resulting perturbations are often tiny in semantic space, but the algorithm cannot effectively use gradient information to efficiently generate perturbations. HotFlip (Sect. 2.4) uses gradient information to make attacks extremely efficient. However, it can only attack the character-level model and produce several meaningless words, which will greatly reduce the overall fluency of the sentence.

Therefore, this paper utilizes Reinforcement Learning (RL) to generate adversarial examples by combining the advantages of above two methods. According to [20], it is regarded as a restricted Markov Decision Process (MDP), which edits the tokens at each position in the source sentence from left to right. Each editing decision depends on the impact of the existing modification on the semantics and the degradation expectation of the system output. Furthermore, inspired by GeneticAttack [1], we also add a Language Model (LM) to measure the fluency of adversarial examples. The generation strategy of adversarial examples is obtained through the continuous interactive feedback of the degree of attack on the translation model and the fluency of the examples.

3.1 Reinforcement Learning

Fig. 2.
figure 2

Reinforcement learning.

As an momentous branch in the field of machine learning, reinforcement learning aims to study the use of agents to conduct model training through interacting with the environment and receiving “feedback” information, so as to “automatically” decide the optimal solution [8]. Figure 2 illustrates the process of reinforcement learning. At each time t, Agent receive state \(s_{t}\) from Environment, and Agent make action \(a_{t}\) on basis of \(s_{t}\), while act on Environment to generate reward \(r_{t}\). Agent reach the new state \(s_{t+1}\) according to \(r_{t}\). Figure 3(a) shows the overall framework of the model, in which the Environment (Sect. 3.2) and the Agent (Sect. 3.3) are two significant parts. Agent learns to modify the token of each position in the original sample sequentially from left to right, and uses a discriminator in the Environment to determine whether the modified sample is semantically consistent with the original sample, at the same time, inputs the modified sample into the language model and translation model to evaluate the fluency of adversarial examples and whether reach the deterioration of the model. The specific process is as in Fig. 7(b).

3.2 Environment

This section details the environment state and the calculation of rewards.

Fig. 3.
figure 3

The architecture and specific process of our method.

State. The state of the Environment is described as \(s_{t} =(x,t)\), where \(x =(x^{1},...,x^{N})\) are N sequences. Adding the begin and end tags (BOS and EOS) to each sequence \(x^{i} =(x_{1},x_{2},...,x_{n})\) and padding them to the same length. \(t\in (1, n)\) indicates the token position to be perturbed by Agent. Environment will consecutively loop for all token positions and update \(s_{t}\) based on Agent’s modification. Environment also yields reward signals until the end or intermediately terminated.

Reward Calculation. The reward \(r_{t}\) consists of a survival reward \(r_{s}\) for each step, a final degradation \(r_{d}\) and the fluency reward \(r_{l}\) when the Agent survives till the end. Therefore, the reward for each time step is calculated as follows:

$$\begin{aligned} \begin{aligned} r_{t} ={\left\{ \begin{array}{ll} &{} \text {} -1, terminated\\ &{} \text {} \frac{1}{N}\sum _{N}\alpha \cdot r_{s}, survive\wedge t\in [1,n) \\ &{} \text {} \frac{1}{N}\sum _{N}(\alpha \cdot r_{s}+\beta \cdot r_{d}+\gamma \cdot r_{l}), survive\wedge t=n \end{array}\right. } \end{aligned} \end{aligned}$$
(4)

where \(\alpha \), \(\beta \) \(\gamma \) and are hyper-parameters. Since the adversarial examples must maintain semantic consistency with the original examples, the Agent must survive for its goal by also fooling the discriminator D, which determines terminal or survival signal by judging whether the modified sequence matches the original target translation y. Once D determines the pair as positive, its corresponding possibility is regarded as the reward, otherwise 0:

$$\begin{aligned} \begin{aligned} r_{s} ={\left\{ \begin{array}{ll} &{} \text {} P(positive|(\hat{x},y);\theta _{d} ) , positive \\ &{} \text {} 0, otherwise \end{array}\right. } \end{aligned} \end{aligned}$$
(5)

When all sequences in x are intermediately terminated, the overall reward \(r_{t}\) yields \(-1\). For example which is defined as “negative” during survival phase, it’s subsequent rewards and actions will be disguised as zero. If the agent survives to the end, Environment generates an additional average reward \(r_{d}\) and fluency reward \(r_{l}\) as the final reward for the current training episode. For \(r_{d}\), we adopt relative degradation [11]:

$$\begin{aligned} \begin{aligned} r_{d} =\frac{score(y,refs)-score(y^{'},refs )}{score(y,refs)} \end{aligned} \end{aligned}$$
(6)

where y and \(y^{'}\) denote original and perturbed output, refs are references, and score is a translation metric. If score(yrefs) is 0 and return 0 as \(r_{d}\).

For the sake of receiving smoother adversarial example, we use a language model to participate in the reward calculation in the Environment so that the Agent can consider the fluency of the examples when modifying the original examples. Typical language models have problems such as zero probability or statistical inadequacies. Katz smoothing is a probability formula to alleviate the “unsmoothness” problem. The basic idea of this method is if exists N-gram language model, directly using the discounted probability; If the higher-order language model is non-exist, the saved probability will be allocated according to the N-1 order language model probability, and so on. We adopt 3-gram and \(r_{l}\) represents the fluency score of sequence \(\hat{x}\):

$$\begin{aligned} \begin{aligned} r_{l}=\sum _{t=1}^{n} P_{katz} \left( x_{t} |x_{t-2}^{t-1} \right) \end{aligned} \end{aligned}$$
(7)
$$\begin{aligned} \begin{aligned} P_{katz} \left( x_{t} |x_{t-2}^{t-1} \right) =\left\{ \begin{matrix} P_{ML} \left( x_{t} |x_{t-2}^{t-1} \right) ,\qquad if \quad count\left( x_{t-2}^{t-1} \right) > 0 \\ \lambda P_{katz}^{\left( n-1 \right) } \left( x_{t} |x_{t-1} \right) ,\quad if \quad count\left( x_{t-2}^{t-1} \right) = 0 \end{matrix}\right. \end{aligned} \end{aligned}$$
(8)

3.3 Agent

As it is shown in Fig. 3(a), Agent uses Actor-Critic algorithm [9] to modify samples, the actor and critic share the same input layers and encoder. Actor takes in Source and current token with its surrounding \((x_{t-1}, x_{t}, x_{t+1})\), then yields a binary distribution to determine whether to attack a token on step t, while critic emits a value \(V(s_{t})\) for every state. Once the Actor determines that a token at specific location can be perturbed, it is replaced with one of the token’s candidates within the distance of \(\sigma \) in the vocabulary. See [20] for more details about training and inferencing.

4 Experiment

4.1 Data Preprocessing

We test our adversarial example generations on Mongolian-Chinese (Mo-Zh) translation tasks of CCMT2019 and Uighur-Chinese (Ug-Zh) translation tasks of CWMT2017. In this paper, we use the open-source Chinese word splitting tool THULAC [10] to split the Chinese corpus, so that the corpus can be better adapted to the model and reduce the problem of poor model performance caused by word order to a certain extent. Moreover, Byte Pair Encoding (BPE) [13] encoding is used to process the Mongolian and Uighur corpus, which is firstly sliced into the corresponding smallest granularity. The training of the translation model is assisted by extracting the highest frequency character substrings into the newly generated dictionary. This approach slices the sentences into a granularity between words and characters, which can preserve the contextual semantics to a certain extent while alleviating the data sparsity problem.

4.2 NMT Model

This paper selects the state-of-the-art RNNSearch [2] and Transformer [15] as victim translation models. For RNN-Search, it’s an encoder-decoder framework based on RNN, we set the hidden layer nodes and word-embedding dimensions to 512 and \(dropout=0.1\). By averaging the single model obtained from the last 20 checkpoints, we use adaptive to adjust the learning rate. For Transformer, we set \(dropout=0.2\) and the dimension of word embedding to 1024, with the learning rate and checkpoint settings consistent with RNNSearch.

4.3 Evaluating Indicator

We report de-tokenized BLEU with SacreBLEU [12] as the evaluation metric of adversarial examples and also test source semantic similarity with human evaluation (HE) ranging from 0 to 5 (Table 1) used by [11] by randomly sampling 20\(\%\) of total sequences for a double-blind test.

Table 1. Human evaluation metrics.

4.4 Adversarial Attack Results and Analysis

We utilize GeneticAttack [1], HotFlip [6] and our method to generate adversarial examples for the test set respectively to attack the translation model. Table 2 illustrates the deterioration degree of adversarial examples to different translation tasks. We randomly select 20% of the adversarial examples for double-blind human evaluation to evaluate the semantic similarity between the adversarial example and the original example.

Table 2. Experiment results for Mo-Zh and Ug-Zh MT attacks. We list BLEU for perturbed test sets generated by each adversarial example generation method. An ideal adversarial example shouldachieve low BLEU and high HE.

As it is shown in Table 2, GeneticAttack uses synonym replacement and genetic algorithm optimization to modify the original sample, resulting in less disturbance, but it lacks some semantic and fluency constraints compared with our method, which is easy to produce grammar problems, and low efficiency due to the inability to use gradient information effectively. Hotflip mainly uses gradient information to generate typo adversarial examples (such as “ \(\rightarrow \) ”) to improve the model’s ability to adapt and correct typos. Although the efficiency is high, it may produce some meaningless words which greatly reduce the overall smoothness of the sentence. Our method uses the actor-critic to modify the token of each position of the original example from left to right, uses the discriminator to restrict the semantics of the confrontation sample, and the language model and the translation model are utilized to evaluate its fluency and the overall deterioration of the model. Therefore, our model stably generate adversarial examples without significant change in semantics and any handcrafted semantic constraints by the same training setting among different models and tasks, achieving stably model degradation and high HE.

4.5 Adversarial Training Results and Analysis

Due to Agent can effectively generate adversary examples that retain semantic information, we can directly use these samples to tune the original translation model. Given the original training data, Transformer models of different methods are used to generate equal number of adversarial examples, which are then paired with the initial target sentences.

Table 3. Fine-tuning with adversarial examples.

We directly train the model by mixing the augmented sentence pairs with the original sentence pairs. As shown in Table 3, utilizing the adversarial examples generated by GeneticAttack and HotFlip to adversarial training enable improve the performance of the model, but the effect is unapparent. The reason is they aren’t guaranteeing semantic consistency and sentence fluency, and has a poor effect attack on the model. Our method can not only guarantee the semantics, but also have strong aggression against the translation model. The results of fine-tuning using the adversarial examples show that the robustness of the model can be significantly improved.

4.6 Ablation Study

Table 4. The results of ablation study, “\(\circ \)” means utilize this module and “\(\times \)” means not.

Table 4 shows the results of ablation study. Line 1 represent only use discriminator (D) rewards to guide Agent optimization. It is clear that NMT reward \(r_{d}\) plays a critical role since removing it impairs model performance (line 2 and line 3). The language model reward is also shown to be benefit for improving performance (line 4) but seem to have relatively smaller contributions than \(r_{d}\).

5 Conclusion

This paper adopts a novel approach to generate adversarial examples for low-resource machine translation tasks. It can expose the defects of the translation model without manual error features, and ensure the semantic consistency with the original examples. The experimental results on CCMT2019 Mongolian-Chinese and CWMT2017 Uighur-Chinese show that this method achieves stable model degradation on different attacked models. Furthermore, we use adversarial examples to fine-tune the model, and the performance is significantly improved after adversarial training.