1 Introduction

Entities such as the United Nations and other international organizations need to translate all documentation they generate into different languages, generating very large multilingual corpora which are ideal for training Statistical Machine Translation (SMT) [1] systems. However, such large corpora are difficult to process, increasing the computational requirements needed to train SMT systems robustly. For example, the corpora used in recent SMT evaluations are in the order of 1 billion running words [2].

Different problems that arise when using this huge pool of sentences are, mainly:

  • The use of all corpora for training increases the computational requirements.

  • Topic mismatch between training and test conditions impacts translation quality.

Nevertheless, standard practice is to train the SMT systems with all the available data, following the premise of “the more data, the better”. This assumption is usually correct if all the data belongs to the same domain, but most commercial SMT systems will not have such luck: most SMT systems are designed to translate very specific text, such as user manuals or medical prospects. Hence, the challenge is to wisely select a subset of bilingual sentences that performs the best for the topic at hand. This is the specific goal of Data Selection (DS), i.e., to select the best subset of an available pool of sentences. The current paper tackles DS by taking advantage of vector space representations of sentences, feeding on the most recent work on distributed representations [3, 4], with the ultimate goal of obtaining corpus subsets that reduce the computational costs entailed, while improving translation quality. The main contributions of this paper are:

  • We present a bilingual DS strategy leveraging a continuous space representation of sentences and words. Section 3 is devoted to this task.

  • We present empirical results comparing our strategy with state-of-the-art DS method based in cross entropy differences. Sections 5.2 and 5.3 are devoted to this task.

This paper is structured as follows. Section 2 summarises the related work. Section 3 presents our DS method using continuous vector-space representations. Section 4 describes the DS methods we use for comparison. In Sect. 5 the experimental design and results are detailed. Finally, conclusions and future work are discussed in Sect. 6.

2 Related Work

We will refer to the available pool of generic-domain sentences as out-of-domain corpus because we assume that it belongs to a different domain than the one to be translated. Similarly, we refer to the corpus belonging to the specific domain of the text to translated as in-domain corpus.

The simplest instance of DS can be found in language modelling, where perplexity-based selection methods have been used [5]. Here, out-of-domain sentences are ranked by their perplexity score. Another perplexity-based approach is presented in [6], where cross-entropy difference is used as a ranking function rather than just perplexity, in order to account for normalization. We apply this criterion as comparison with our DS technique. Different works use perplexity-related DS strategies [7, 8]. In these papers, the authors report good results when using the strategy presented in [6], and such strategy has become a de-facto standard in the SMT research community. In [7] the authors describe a new bilingual method based on the original proposal by [6], and will be explained in detail in Sect. 4.

Other works have applied information retrieval methods for DS [9], in order to produce different sub-models which are then weighted. The baseline was defined as the result obtained by training only with the corpus that shares the same domain with the test. They claim that they are able to improve the baseline translation quality with their method. However, they do not provide a comparison with a model trained on all the corpora available. More recently, [10] leveraged neural language models to perform DS, reporting substantial gains over conventional n-gram language model-based DS.

3 Data Selection Approaches

Here we describe our bilingual DS for SMT using continuous vector-space representation (CVR) of sentences or documents with the purpose of using each side of the corpus (source and target). This new technique is an extension over previous work describing only monolingual DS using a continuous CVR. Our DS approach requires:

  1. 1.

    A CVR of words (Sect. 3.1) and CVR of sentences or documents (Sect. 3.2).

  2. 2.

    A selection algorithm based on these CVR (Sect. 3.3), and its bilingual extension (Sect. 3.4).

3.1 Continuous Vector-Space Representation of Words

CVR of words have been widely used in a variety of natural language processing applications. These representations have recently demonstrated promising results across a variety of tasks [1113], such as speech recognition, part-of-speech tagging, sentiment classification and identification, and machine translation.

The idea of representing words in vector space was originally proposed by [14]. The limitation of these proposals were that computational requirements quickly became unpractical for growing vocabulary sizes |V|. However, work performed recently in [3, 15] made it possible to overcome such drawback, while still representing words as high dimensional real valued vectors: each word \(w_i\) in the vocabulary V, \(w_i \in V\), is represented as a real-valued vector of some fixed dimension D, i.e., \(f(w_i) \in R^{D}\;\forall i = 1,\ldots , |V| \), capturing the similarity (lexical, semantic and syntactic) between the words.

Two approaches are proposed in [3], namely, the Continuous Bag of Words Model (CBOW) and the Continuous Skip-Gram Model. CBOW forces the neural net to predict the current word by means of the surrounding words, and Skip-Gram forces the neural net to predict surrounding words using the current word. These two approaches were compared to previously existing approaches, such as the ones proposed in [16], obtaining a considerably better performance in terms of training time. In addition, experimental results also demonstrated that the Skip-Gram model offers better performance on average, excelling especially at the semantic level [3]. These results were confirmed in our own preliminary work, and hence we used the Skip-Gram approach to generate our distributed representations of word.

We used the Word2vecFootnote 1 toolkit to obtain the representations of words. Word2vec takes a text corpus as input and produces the word vectors as output. It first constructs a vocabulary V from the training corpus and then learns the CVR of the words.

A problem that arises when using CVR of words is how to represent a whole sentence (or document) with a continuous vector. Following the idiosyncrasy described in the previous paragraph (i.e., semantically close words are also close in their CVR), next section presents the different sentence representations employed in the present work.

3.2 Continuous Vector-Space Representation of Sentences

Numerous works have attempted to extend the CVR of words to the sentence or phrase level (just to name a few, [4, 17, 18]). In the present work, we used two different CVRs of sentences, denoted here as \(F(\mathbf {x})\) (or, in some cases and to simplify notation, \(F_{\mathbf {x}}\)):

  1. 1.

    The first one is the most intuitive approach, which relies on using a weighted arithmetic mean of all the words in the document or sentence (as proposed by [17, 19]:

    $$\begin{aligned} F_{\mathbf {x}}=F(\mathbf {x})=\frac{\sum _{w\in \mathbf {x}} N_{\mathbf {x}}(w)f(w)}{\sum _{w\in \mathbf {x}} N_{\mathbf {x}}(w)} \end{aligned}$$
    (1)

    where w is a word in sentence \(\mathbf {x}\), f(w) is the CVR of w, obtained as described above, and \(N_{\mathbf {x}}(w)\) is the count of w in sentence \(\mathbf {x}\). We will refer to this representation by Mean-vec.

  2. 2.

    A more sophisticated approach is presented by [4], where the author adapted the continuous Skip-Gram model [3] to generate representative vectors of sentences or documents by following the same Skip-Gram architecture, generating a special vector \(F_{\mathbf {x}}\). We will refer to this representation by Document-vec.

3.3 DS Using Vector Space Representations of Sentences

Since the objective of DS is to increase the informativeness of the in-domain training corpus, it seems important to choose out-of-domain sentences that provide information considered relevant with respect to the in-domain corpus I.

Algorithm 1 shows the procedure. Here, G is the out-domain-corpus, \(\mathbf {x}\) is an out-of-domain sentence (\(\mathbf {x}\in G\)), \(F_{\mathbf {x}}\) is the CVR of \(\mathbf {x}\), and |G| is the number of sentences in G. Then, our objective is to select data from G such that it is the most suitable for translating data belonging to the in-domain corpus I. For this purpose, we define \(F_{\mathbf {S}}\), assuming S as the concatenation of all in-domain corpus I.

figure a

\(cos(\cdot ,\cdot )\), which implements the cosine similarity between two sentence vectors:

$$\begin{aligned} cos(F_{\mathbf {S}},F_{\mathbf {x}})=\frac{F_{\mathbf {S}}\cdot F_{\mathbf {x}}}{\Vert F_{\mathbf {S}} \Vert \cdot \Vert F_{\mathbf {x}} \Vert } \end{aligned}$$
(2)

Note that it would have been possible to use any other similarity metric. Here, the best value for \(cos(\cdot ,\cdot )\) is 1, and the worst value for \(cos(\cdot ,\cdot )\) is 0 and \(\tau \) is a certain threshold to be established empirically.

3.4 Bilingual DS Using Vector Space Representations of Sentences

In this section, we extend the CVR presented in Sect. 3.3 for making use of bilingual data. Here, the purpose is to tackle directly the bilingual nature of the problem of DS within an SMT setting. By including both sides of the corpus (source and target sentences), Algorithm 1 is modified to obtain Algorithm 2.

Here, \(G_x\) and \(G_y\) is the out-of-domain corpus (source and target, respectively), and \(\mathbf {x_G}\) and \(\mathbf {y_G}\) are out-of-domain sentences (\(\mathbf {x_G} \in G_x; \mathbf {y_G} \in G_y\), respectively). \(F_{\mathbf {x_{G}}}\) is the CVR of \(\mathbf {x_G}\), and \(F_{\mathbf {y_{G}}}\) is the CVR of \(\mathbf {y_G}\). Similarly as done for \(F_\mathbf {S}\), we define \(F_{\mathbf {S_{x}}}\) as the CVR of \(S_x\), i.e., the CVR of the concatenation of all source in-domain data and \(F_{\mathbf {S_{y}}}\) as the CVR of \(S_y\), i.e., the CVR of the concatenation of all target in-domain data.

figure b

4 Cross-Entropy Difference Method

As mentioned in Sect. 2, one established DS method consists in scoring the sentences in the out-of-domain corpus by their perplexity. [6] use cross-entropy rather than perplexity, even though they are both monotonically related. The perplexity of a given sentence \(\mathbf {x}\) with empirical n-gram distribution p given a language model q is:

$$\begin{aligned} 2^{-\sum _{x}p(x) \log q(x)}=2^{H (p,q)} \end{aligned}$$
(3)

where H(pq) is the cross-entropy between p and q. The formulation proposed in [6] is: Let I be an in-domain corpus and G be an out-of-domain corpus. Let \(H_{I} (\mathbf {x})\) be the cross-entropy, according to a language model trained on I, of a sentence \(\mathbf {x}\) drawn from G. Let \(H_{G}(\mathbf {x})\) be the cross-entropy of \(\mathbf {x}\) according to a language model trained on G. The cross-entropy score of \(\mathbf {x}\) is then defined as

$$\begin{aligned} c(\mathbf {x}) = H_{I} (\mathbf {x}) - H_{G}(\mathbf {x}) \end{aligned}$$
(4)

Note that this method is defined in terms of I, as defined by the original authors. Even though it would also be feasible to define this method in terms of S, such re-definition lies beyond the scope of this paper, since our purpose is only to use this method only for comparison purposes.

In [7], the authors propose a extention to their cross entropy method [6] so that it is able to deal with bilingual information. To this end, they sum the cross-entropy difference over each side of the corpus, both source and target. Let (I and G) be a in-domain source corpus and out-of-domain source corpus respectively and (L and J) be a target corpora. Then, the cross-entropy difference is defined as:

$$\begin{aligned} c(\mathbf {x}) = [H_{I} (\mathbf {x})-H_{G}(\mathbf {x})] +[H_{L} (\mathbf {y}) - H_{J}(\mathbf {y})] \end{aligned}$$
(5)

5 Experiments

In this section, we describe the experimental framework employed to assess the performance of our DS method. Then, we show the comparative with cross-entropy DS.

Table 1. In-domain corpora main figures. (EMEA-Domain) is the in-domain corpus, (Medical-Test) the evaluation data, and (Medical-Mert) development set. \(\vert S \vert \) stands for number of sentences, \(\vert W \vert \) for number of words, and \(\vert V \vert \) for vocabulary size.

5.1 Experimental Setup

We evaluated empirically the DS method described in Sect. 3. For the out-of-domain corpus, we used the Europarl corpus [20], which is composed of translations of the proceedings of the European parliament. As in-domain data, we used the EMEA corpus [21], which is available in 22 languages and contains documents from the European Medicines Agency. We conducted experiments with different language pairs (English-French [En-FR]; French-English [Fr-En]; German-English [De-En]; English-German [En-De]) so as to test the robustness of the results achieved. The main figures of the corpora used are shown in Tables 1 and 2.

Table 2. Out-of-domain corpus main figures (abbreviations explained in Table 1).

All experiments were carried out using the open-source phrase-based SMT toolkit Moses [22]. The decoder features a statistical log-linear model including a phrase-based translation model, a language model, a distortion model and word and phrase penalties. The log-lineal combination weights \(\lambda \) were optimized using MERT (minimum error rate training) [23]. Since MERT requires a random initialisation of \(\lambda \) that often leads to different local optima being reached, every point in each plot of this paper constitutes the average of 10 repetitions with the purpose of providing robustness to the results. In the tables reporting translation quality, \(95\,\%\) confidence intervals of these repetitions are shown, but are omitted from the plots for purpose of clarity. We compared the selection methods with two baseline systems. The first one was obtained by training the SMT system with EMEA-Domain data. We will refer to this setup with the name of baseline-emea. A second baseline experiment has been carried out with the concatenation of the Europarl corpus and EMEA training data (i.e., all the data available). We will refer to this setup as bsln-emea-euro. We also included results for a purely random sentence selection without replacement. In the plots, each point corresponding to random selection represents the average of 5 repetitions.

SMT output will be evaluated by means of BLEU (BiLingual Evaluation Understudy) [24], which measures the precision of uni-grams, bigrams, trigrams, and 4-grams with respect to the reference translation, with a penalty for too short sentences [24].

Word2vec (Sect. 3.1) has different parameters that need to be adjusted. We conducted experiments with different vector dimensions, i.e., \(D=\lbrace 100, 200, 300, 400,500\rbrace \). In addition, a given word appears is required to appear \(n_c\) times in the corpus so as to be considered for computing its vector. We analysed the effect of considering different values \(n_c=\lbrace 1,3,5,10\rbrace \). Experiments not reported here for space reasons led to establishing: sentences vector size \(v\_s=200\) and \(n_c=1\) for all further experiments reported.

5.2 Comparative with Cross-Entropy Selection

As a first step, we compare our DS method with the cross-entropy method both in yours monolingual version (Sect. 5.1). Results in Fig. 1 show the effect of adding sentences to the in-domain corpus. We only show cross-entropy results using 2-grams, which was the best result according to previous work. For our DS method, we tested both CVR methods (Document-vec and Mean-vec).

Fig. 1.
figure 1

Effect of adding sentences over the BLEU score using our monolingual DS method, the original cross-entropy method, and random selection. Horizontal lines represent the baseline-emea and bsln-emea-euro scores.

Several conclusions can be drawn:

  • The DS techniques are able to improve translation quality when compared to the baseline-emea setting, in all language pairs.

  • All DS methods are mostly able to improve over random selection, especially when low amounts of data are added. This is reasonable, since all DS methods including random will eventually converge to the same point: adding all the data available. Even though these results should be expected, previous works (reported in Sect. 2) revealed that beating random was very hard.

  • In Fig. 1, the results obtained with our DS method are slightly better (or similar) than the ones obtained with cross-entropy.

5.3 Comparative with Bilingual Cross-Entropy Selection

Results comparing our bilingual DS method with bilingual corss-entropy are shown in Fig. 2. In the case of our DS method, the same approach as in previous section was used. Several conclusions can be drawn:

Fig. 2.
figure 2

Effect of adding sentences over the BLEU score using our bilingual DS method, the bilingual cross-entropy method, and random selection. Horizontal lines represent the baseline-emea and bsln-emea-euro scores.

  • Our bilingual DS technique provides better results than including the full out-of-domain corpus (bsln-emea-euro) in language pairs EN-FR, FR-EN, and EN-DE. Specially, the improvements obtained are in the range of \([0,3-0,9]\) BLEU points using less than \([27\,\%-19\,\%]\) of the out-of-domain corpus. In the DE-EN pair our DS strategy does improve the results over including the full out-of-domain corpus, but results are very similar using less than \(33\,\%\) of the out-of-domain corpus.

  • The results achieved by our bilingual DS strategy are consistently better than those achieved by the bilingual cross-entropy method.

  • For equal amount of sentences, translation quality is significantly better with the bilingual DS method, as compared to its monolingual form (Fig. 1). Hence, the bilingual DS strategy is able to make good use of the bilingual information, reaching a better subset of the out-of-domain data.

5.4 Summary of the Results

Table 3 shows the best results obtained. As shown, our method is able to yield competitive results for each combination of the language. Note that the bilingual cross entropy methods tends to select more sentences, while translation quality tends to be slightly worse when compared to our method.

Table 3. Summary of the results obtained. \(\# Sentences \) for number of sentences, given in terms of the in-domain corpus size, and \((+)\) the number of sentences selected.

6 Conclusion and Future Work

In this work, we presented a bilingual data selection method on CVR of sentence or documents, which intend to yield similar representations for semantically close sentences. In addition, we perform a comparison of our technique with state-of-the-art technique (cross-entropy difference). When comparing our method, an important conclusion stands out: our method is able to yield similar or better quality than the state-of-the-art method and reduce the number of selected sentences. In future work, we will carry out new experiments with bigger and diverse data sets in and different languages for example German language group. In addition, we also intend to combine the both sides of the corpus proposed in more sophisticated ways. Finally, we intend to compare our bilingual data selection method with other data selection techniques.