Bilingual Data Selection Using a Continuous Vector-Space Representation

Chinea-Rios, Mara; Sanchis-Trilles, Germán; Casacuberta, Francisco

doi:10.1007/978-3-319-49055-7_9

Mara Chinea-Rios¹⁸,
Germán Sanchis-Trilles¹⁹ &
Francisco Casacuberta¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10029))

Included in the following conference series:

Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR)

1418 Accesses

Abstract

Data selection aims to select the best data subset from an available pool of sentences with which to train a pattern recognition system. In this article, we present a bilingual data selection method that leverages a continuous vector-space representation of word sequences for selecting the best subset of a bilingual corpus, for the application of training a machine translation system. We compared our proposal with a state-of-the-art data selection technique (cross-entropy) obtaining very promising results, which were coherent across different language pairs.

You have full access to this open access chapter, Download conference paper PDF

Mixing Textual Data Selection Methods for Improved In-Domain Data Adaptation

Survey of data-selection methods in statistical machine translation

Article 28 December 2015

Adaptation of Machine Translation Models with Back-Translated Data Using Transductive Data Selection Methods

Keywords

1 Introduction

Entities such as the United Nations and other international organizations need to translate all documentation they generate into different languages, generating very large multilingual corpora which are ideal for training Statistical Machine Translation (SMT) [1] systems. However, such large corpora are difficult to process, increasing the computational requirements needed to train SMT systems robustly. For example, the corpora used in recent SMT evaluations are in the order of 1 billion running words [2].

Different problems that arise when using this huge pool of sentences are, mainly:

The use of all corpora for training increases the computational requirements.
Topic mismatch between training and test conditions impacts translation quality.

Nevertheless, standard practice is to train the SMT systems with all the available data, following the premise of “the more data, the better”. This assumption is usually correct if all the data belongs to the same domain, but most commercial SMT systems will not have such luck: most SMT systems are designed to translate very specific text, such as user manuals or medical prospects. Hence, the challenge is to wisely select a subset of bilingual sentences that performs the best for the topic at hand. This is the specific goal of Data Selection (DS), i.e., to select the best subset of an available pool of sentences. The current paper tackles DS by taking advantage of vector space representations of sentences, feeding on the most recent work on distributed representations [3, 4], with the ultimate goal of obtaining corpus subsets that reduce the computational costs entailed, while improving translation quality. The main contributions of this paper are:

We present a bilingual DS strategy leveraging a continuous space representation of sentences and words. Section 3 is devoted to this task.
We present empirical results comparing our strategy with state-of-the-art DS method based in cross entropy differences. Sections 5.2 and 5.3 are devoted to this task.

This paper is structured as follows. Section 2 summarises the related work. Section 3 presents our DS method using continuous vector-space representations. Section 4 describes the DS methods we use for comparison. In Sect. 5 the experimental design and results are detailed. Finally, conclusions and future work are discussed in Sect. 6.

2 Related Work

We will refer to the available pool of generic-domain sentences as out-of-domain corpus because we assume that it belongs to a different domain than the one to be translated. Similarly, we refer to the corpus belonging to the specific domain of the text to translated as in-domain corpus.

The simplest instance of DS can be found in language modelling, where perplexity-based selection methods have been used [5]. Here, out-of-domain sentences are ranked by their perplexity score. Another perplexity-based approach is presented in [6], where cross-entropy difference is used as a ranking function rather than just perplexity, in order to account for normalization. We apply this criterion as comparison with our DS technique. Different works use perplexity-related DS strategies [7, 8]. In these papers, the authors report good results when using the strategy presented in [6], and such strategy has become a de-facto standard in the SMT research community. In [7] the authors describe a new bilingual method based on the original proposal by [6], and will be explained in detail in Sect. 4.

Other works have applied information retrieval methods for DS [9], in order to produce different sub-models which are then weighted. The baseline was defined as the result obtained by training only with the corpus that shares the same domain with the test. They claim that they are able to improve the baseline translation quality with their method. However, they do not provide a comparison with a model trained on all the corpora available. More recently, [10] leveraged neural language models to perform DS, reporting substantial gains over conventional n-gram language model-based DS.

3 Data Selection Approaches

Here we describe our bilingual DS for SMT using continuous vector-space representation (CVR) of sentences or documents with the purpose of using each side of the corpus (source and target). This new technique is an extension over previous work describing only monolingual DS using a continuous CVR. Our DS approach requires:

1.
A CVR of words (Sect. 3.1) and CVR of sentences or documents (Sect. 3.2).
2.
A selection algorithm based on these CVR (Sect. 3.3), and its bilingual extension (Sect. 3.4).

3.1 Continuous Vector-Space Representation of Words

CVR of words have been widely used in a variety of natural language processing applications. These representations have recently demonstrated promising results across a variety of tasks [11–13], such as speech recognition, part-of-speech tagging, sentiment classification and identification, and machine translation.

The idea of representing words in vector space was originally proposed by [14]. The limitation of these proposals were that computational requirements quickly became unpractical for growing vocabulary sizes |V|. However, work performed recently in [3, 15] made it possible to overcome such drawback, while still representing words as high dimensional real valued vectors: each word $w_i$ in the vocabulary V, $w_i \in V$, is represented as a real-valued vector of some fixed dimension D, i.e., $f(w_i) \in R^{D}\;\forall i = 1,\ldots , |V| $, capturing the similarity (lexical, semantic and syntactic) between the words.

Two approaches are proposed in [3], namely, the Continuous Bag of Words Model (CBOW) and the Continuous Skip-Gram Model. CBOW forces the neural net to predict the current word by means of the surrounding words, and Skip-Gram forces the neural net to predict surrounding words using the current word. These two approaches were compared to previously existing approaches, such as the ones proposed in [16], obtaining a considerably better performance in terms of training time. In addition, experimental results also demonstrated that the Skip-Gram model offers better performance on average, excelling especially at the semantic level [3]. These results were confirmed in our own preliminary work, and hence we used the Skip-Gram approach to generate our distributed representations of word.

We used the Word2vec^{Footnote 1} toolkit to obtain the representations of words. Word2vec takes a text corpus as input and produces the word vectors as output. It first constructs a vocabulary V from the training corpus and then learns the CVR of the words.

A problem that arises when using CVR of words is how to represent a whole sentence (or document) with a continuous vector. Following the idiosyncrasy described in the previous paragraph (i.e., semantically close words are also close in their CVR), next section presents the different sentence representations employed in the present work.

3.2 Continuous Vector-Space Representation of Sentences

Numerous works have attempted to extend the CVR of words to the sentence or phrase level (just to name a few, [4, 17, 18]). In the present work, we used two different CVRs of sentences, denoted here as $F(\mathbf {x})$ (or, in some cases and to simplify notation, $F_{\mathbf {x}}$):

1.
The first one is the most intuitive approach, which relies on using a weighted arithmetic mean of all the words in the document or sentence (as proposed by [17, 19]:
$$\begin{aligned} F_{\mathbf {x}}=F(\mathbf {x})=\frac{\sum _{w\in \mathbf {x}} N_{\mathbf {x}}(w)f(w)}{\sum _{w\in \mathbf {x}} N_{\mathbf {x}}(w)} \end{aligned}$$
(1)
where w is a word in sentence $\mathbf {x}$, f(w) is the CVR of w, obtained as described above, and $N_{\mathbf {x}}(w)$ is the count of w in sentence $\mathbf {x}$. We will refer to this representation by Mean-vec.
2.
A more sophisticated approach is presented by [4], where the author adapted the continuous Skip-Gram model [3] to generate representative vectors of sentences or documents by following the same Skip-Gram architecture, generating a special vector $F_{\mathbf {x}}$. We will refer to this representation by Document-vec.

3.3 DS Using Vector Space Representations of Sentences

Since the objective of DS is to increase the informativeness of the in-domain training corpus, it seems important to choose out-of-domain sentences that provide information considered relevant with respect to the in-domain corpus I.

Algorithm 1 shows the procedure. Here, G is the out-domain-corpus, $\mathbf {x}$ is an out-of-domain sentence ($\mathbf {x}\in G$), $F_{\mathbf {x}}$ is the CVR of $\mathbf {x}$, and |G| is the number of sentences in G. Then, our objective is to select data from G such that it is the most suitable for translating data belonging to the in-domain corpus I. For this purpose, we define $F_{\mathbf {S}}$, assuming S as the concatenation of all in-domain corpus I.

$cos(\cdot ,\cdot )$, which implements the cosine similarity between two sentence vectors:

$$\begin{aligned} cos(F_{\mathbf {S}},F_{\mathbf {x}})=\frac{F_{\mathbf {S}}\cdot F_{\mathbf {x}}}{\Vert F_{\mathbf {S}} \Vert \cdot \Vert F_{\mathbf {x}} \Vert } \end{aligned}$$

(2)

Note that it would have been possible to use any other similarity metric. Here, the best value for $cos(\cdot ,\cdot )$ is 1, and the worst value for $cos(\cdot ,\cdot )$ is 0 and $\tau $ is a certain threshold to be established empirically.

3.4 Bilingual DS Using Vector Space Representations of Sentences

In this section, we extend the CVR presented in Sect. 3.3 for making use of bilingual data. Here, the purpose is to tackle directly the bilingual nature of the problem of DS within an SMT setting. By including both sides of the corpus (source and target sentences), Algorithm 1 is modified to obtain Algorithm 2.

Here, $G_x$ and $G_y$ is the out-of-domain corpus (source and target, respectively), and $\mathbf {x_G}$ and $\mathbf {y_G}$ are out-of-domain sentences ($\mathbf {x_G} \in G_x; \mathbf {y_G} \in G_y$, respectively). $F_{\mathbf {x_{G}}}$ is the CVR of $\mathbf {x_G}$, and $F_{\mathbf {y_{G}}}$ is the CVR of $\mathbf {y_G}$. Similarly as done for $F_\mathbf {S}$, we define $F_{\mathbf {S_{x}}}$ as the CVR of $S_x$, i.e., the CVR of the concatenation of all source in-domain data and $F_{\mathbf {S_{y}}}$ as the CVR of $S_y$, i.e., the CVR of the concatenation of all target in-domain data.

4 Cross-Entropy Difference Method

As mentioned in Sect. 2, one established DS method consists in scoring the sentences in the out-of-domain corpus by their perplexity. [6] use cross-entropy rather than perplexity, even though they are both monotonically related. The perplexity of a given sentence $\mathbf {x}$ with empirical n-gram distribution p given a language model q is:

$$\begin{aligned} 2^{-\sum _{x}p(x) \log q(x)}=2^{H (p,q)} \end{aligned}$$

(3)

where H(p, q) is the cross-entropy between p and q. The formulation proposed in [6] is: Let I be an in-domain corpus and G be an out-of-domain corpus. Let $H_{I} (\mathbf {x})$ be the cross-entropy, according to a language model trained on I, of a sentence $\mathbf {x}$ drawn from G. Let $H_{G}(\mathbf {x})$ be the cross-entropy of $\mathbf {x}$ according to a language model trained on G. The cross-entropy score of $\mathbf {x}$ is then defined as

$$\begin{aligned} c(\mathbf {x}) = H_{I} (\mathbf {x}) - H_{G}(\mathbf {x}) \end{aligned}$$

(4)

Note that this method is defined in terms of I, as defined by the original authors. Even though it would also be feasible to define this method in terms of S, such re-definition lies beyond the scope of this paper, since our purpose is only to use this method only for comparison purposes.

In [7], the authors propose a extention to their cross entropy method [6] so that it is able to deal with bilingual information. To this end, they sum the cross-entropy difference over each side of the corpus, both source and target. Let (I and G) be a in-domain source corpus and out-of-domain source corpus respectively and (L and J) be a target corpora. Then, the cross-entropy difference is defined as:

$$\begin{aligned} c(\mathbf {x}) = [H_{I} (\mathbf {x})-H_{G}(\mathbf {x})] +[H_{L} (\mathbf {y}) - H_{J}(\mathbf {y})] \end{aligned}$$

(5)

5 Experiments

In this section, we describe the experimental framework employed to assess the performance of our DS method. Then, we show the comparative with cross-entropy DS.

Table 1. In-domain corpora main figures. (EMEA-Domain) is the in-domain corpus, (Medical-Test) the evaluation data, and (Medical-Mert) development set. $\vert S \vert $ stands for number of sentences, $\vert W \vert $ for number of words, and $\vert V \vert $ for vocabulary size.

Full size table

5.1 Experimental Setup

We evaluated empirically the DS method described in Sect. 3. For the out-of-domain corpus, we used the Europarl corpus [20], which is composed of translations of the proceedings of the European parliament. As in-domain data, we used the EMEA corpus [21], which is available in 22 languages and contains documents from the European Medicines Agency. We conducted experiments with different language pairs (English-French [En-FR]; French-English [Fr-En]; German-English [De-En]; English-German [En-De]) so as to test the robustness of the results achieved. The main figures of the corpora used are shown in Tables 1 and 2.

Table 2. Out-of-domain corpus main figures (abbreviations explained in Table 1).

Full size table

All experiments were carried out using the open-source phrase-based SMT toolkit Moses [22]. The decoder features a statistical log-linear model including a phrase-based translation model, a language model, a distortion model and word and phrase penalties. The log-lineal combination weights $\lambda $ were optimized using MERT (minimum error rate training) [23]. Since MERT requires a random initialisation of $\lambda $ that often leads to different local optima being reached, every point in each plot of this paper constitutes the average of 10 repetitions with the purpose of providing robustness to the results. In the tables reporting translation quality, $95\,\%$ confidence intervals of these repetitions are shown, but are omitted from the plots for purpose of clarity. We compared the selection methods with two baseline systems. The first one was obtained by training the SMT system with EMEA-Domain data. We will refer to this setup with the name of baseline-emea. A second baseline experiment has been carried out with the concatenation of the Europarl corpus and EMEA training data (i.e., all the data available). We will refer to this setup as bsln-emea-euro. We also included results for a purely random sentence selection without replacement. In the plots, each point corresponding to random selection represents the average of 5 repetitions.

SMT output will be evaluated by means of BLEU (BiLingual Evaluation Understudy) [24], which measures the precision of uni-grams, bigrams, trigrams, and 4-grams with respect to the reference translation, with a penalty for too short sentences [24].

Word2vec (Sect. 3.1) has different parameters that need to be adjusted. We conducted experiments with different vector dimensions, i.e., $D=\lbrace 100, 200, 300, 400,500\rbrace $. In addition, a given word appears is required to appear $n_c$ times in the corpus so as to be considered for computing its vector. We analysed the effect of considering different values $n_c=\lbrace 1,3,5,10\rbrace $. Experiments not reported here for space reasons led to establishing: sentences vector size $v\_s=200$ and $n_c=1$ for all further experiments reported.

5.2 Comparative with Cross-Entropy Selection

As a first step, we compare our DS method with the cross-entropy method both in yours monolingual version (Sect. 5.1). Results in Fig. 1 show the effect of adding sentences to the in-domain corpus. We only show cross-entropy results using 2-grams, which was the best result according to previous work. For our DS method, we tested both CVR methods (Document-vec and Mean-vec).

Several conclusions can be drawn:

The DS techniques are able to improve translation quality when compared to the baseline-emea setting, in all language pairs.
All DS methods are mostly able to improve over random selection, especially when low amounts of data are added. This is reasonable, since all DS methods including random will eventually converge to the same point: adding all the data available. Even though these results should be expected, previous works (reported in Sect. 2) revealed that beating random was very hard.
In Fig. 1, the results obtained with our DS method are slightly better (or similar) than the ones obtained with cross-entropy.

5.3 Comparative with Bilingual Cross-Entropy Selection

Results comparing our bilingual DS method with bilingual corss-entropy are shown in Fig. 2. In the case of our DS method, the same approach as in previous section was used. Several conclusions can be drawn:

Our bilingual DS technique provides better results than including the full out-of-domain corpus (bsln-emea-euro) in language pairs EN-FR, FR-EN, and EN-DE. Specially, the improvements obtained are in the range of $[0,3-0,9]$ BLEU points using less than $[27\,\%-19\,\%]$ of the out-of-domain corpus. In the DE-EN pair our DS strategy does improve the results over including the full out-of-domain corpus, but results are very similar using less than $33\,\%$ of the out-of-domain corpus.
The results achieved by our bilingual DS strategy are consistently better than those achieved by the bilingual cross-entropy method.
For equal amount of sentences, translation quality is significantly better with the bilingual DS method, as compared to its monolingual form (Fig. 1). Hence, the bilingual DS strategy is able to make good use of the bilingual information, reaching a better subset of the out-of-domain data.

5.4 Summary of the Results

Table 3 shows the best results obtained. As shown, our method is able to yield competitive results for each combination of the language. Note that the bilingual cross entropy methods tends to select more sentences, while translation quality tends to be slightly worse when compared to our method.

Table 3. Summary of the results obtained. $\# Sentences $ for number of sentences, given in terms of the in-domain corpus size, and $(+)$ the number of sentences selected.

Full size table

6 Conclusion and Future Work

In this work, we presented a bilingual data selection method on CVR of sentence or documents, which intend to yield similar representations for semantically close sentences. In addition, we perform a comparison of our technique with state-of-the-art technique (cross-entropy difference). When comparing our method, an important conclusion stands out: our method is able to yield similar or better quality than the state-of-the-art method and reduce the number of selected sentences. In future work, we will carry out new experiments with bigger and diverse data sets in and different languages for example German language group. In addition, we also intend to combine the both sides of the corpus proposed in more sophisticated ways. Finally, we intend to compare our bilingual data selection method with other data selection techniques.

Notes

1.
www.code.google.com/archive/p/word2vec/.

References

Brown, P.F., Cocke, J., Pietra, S.A.D., Pietra, V.J.D., Jelinek, F., Lafferty, J.D., Mercer, R.L., Roossin, P.S.: A statistical approach to machine translation. Comput. Linguist. 16, 79–85 (1990)
Google Scholar
Callison-Burch, C., Koehn, P., Monz, C., Peterson, K., Przybocki, M., Zaidan, O.F.: Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation. In: Proceedings of ACL, pp. 17–53 (2010)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space, January 2013. arXiv:1301.3781
Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents (2014). arXiv:1405.4053
Gao, J., Goodman, J., Li, M., Lee, K.-F.: Toward a unified approach to statistical language modeling for Chinese. ACM TALIP 1(1), 3–33 (2002)
Article Google Scholar
Moore, R.C., Lewis, W.: Intelligent selection of language model training data. In: Proceedings of ACL, pp. 220–224 (2010)
Google Scholar
Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of EMNLP, pp. 355–362 (2011)
Google Scholar
Schwenk, H., Rousseau, A., Attik, M.: Large, pruned or continuous space language models on a GPU for statistical machine translation. In: Proceedings of the NAACL-HLT, pp. 11–19 (2012)
Google Scholar
Lü, Y., Huang, J., Liu, Q.: Improving statistical machine translation performance by training data selection and optimization. In: Proceedings of EMNLP, pp. 343–350 (2007)
Google Scholar
Duh, K., Neubig, G., Sudoh, K., Tsukada, H.: Adaptation data selection using neural language models: experiments in machine translation. In: Proceedings of ACL, pp. 678–683 (2013)
Google Scholar
Collobert, R., Weston, J.: A unified architecture for natural language processing. In: Proceedings of ICML, pp. 160–167 (2008)
Google Scholar
Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment classification: a deep learning approach. In: Proceedings of ICML, pp. 513–520 (2011)
Google Scholar
Cho, K., van Merriënboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches, arxiv:1409.1259 (2014)
McClelland, J.L., Rumelhart, D.E., PDP Research Group, et al.: Parallel Distributed Processing, vol. 2. Cambridge University Press, Cambridge (1987)
Google Scholar
Paulus, R., Socher, R., Manning, C.D.: Global belief recursive neural networks. In: Proceedings of Advances in Neural Information Processing Systems, pp. 2888–2896 (2014)
Google Scholar
Mikolov, T., Karafit, M., Burget, L., Eernocký, J., Khudanpur, S.: Recurrent neural network based language model. In: Proceedings of INTERSPEECH, pp. 1045–1048 (2010)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of Advances in neural information processing systems, pp. 3111–3119 (2013)
Google Scholar
Socher, R., Lin, C.C., Manning, C., Ng, A.Y.: Parsing natural scenes and natural language with recursive neural networks. In: Proceedings of ICML, pp. 129–136 (2011)
Google Scholar
Kågebäck, M., Mogren, O., Tahmasebi, N., Dubhashi, D.: Extractive summarization using continuous vector space models. In: Proceedings of the Workshop on Continuous Vector Space Models and their Compositionality, pp. 31–39 (2014)
Google Scholar
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of MT Summit, pp. 79–86 (2005)
Google Scholar
Tiedemann, J.: News from OPUS-a collection of multilingual parallel corpora with tools and interfaces. In: Proceedings of RANLP, pp. 237–248 (2009)
Google Scholar
Koehn, P., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of ACL, pp. 177–180 (2007)
Google Scholar
Och, F.J. : Minimum error rate training in statistical machine translation. In: Proceedings of ACL, pp. 160–167 (2003)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of ACL, pp. 311–318 (2002)
Google Scholar

Download references

Acknowledgments

The research leading to these results has received funding from the Generalitat Valenciana under grant PROMETEOII/2014/030 and the FPI (2014) grant by Universitat Politècnica de València.

Author information

Authors and Affiliations

Pattern Recognition and Human Language Technology Research Center, Universitat Politècnica de València, 46022, Valencia, Spain
Mara Chinea-Rios & Francisco Casacuberta
Sciling, Universitat Politècnica de València, 46022, Valencia, Spain
Germán Sanchis-Trilles

Authors

Mara Chinea-Rios
View author publications
You can also search for this author in PubMed Google Scholar
Germán Sanchis-Trilles
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Casacuberta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mara Chinea-Rios .

Editor information

Editors and Affiliations

Data 61 - CSIRO , Canberra, Australia
Antonio Robles-Kelly
Pattern Recognition Laboratory, Technical University of Delft Pattern Recognition Laboratory, CD Delft, The Netherlands
Marco Loog
Electrical and Electronic Engineering, University of Cagliari Electrical and Electronic Engineering, Cagliari, Italy
Battista Biggio
Computación e IA, Universidad de Alicante Computación e IA, Alicante, Spain
Francisco Escolano
Computer Science, University of York Computer Science, York, United Kingdom
Richard Wilson

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chinea-Rios, M., Sanchis-Trilles, G., Casacuberta, F. (2016). Bilingual Data Selection Using a Continuous Vector-Space Representation. In: Robles-Kelly, A., Loog, M., Biggio, B., Escolano, F., Wilson, R. (eds) Structural, Syntactic, and Statistical Pattern Recognition. S+SSPR 2016. Lecture Notes in Computer Science(), vol 10029. Springer, Cham. https://doi.org/10.1007/978-3-319-49055-7_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-49055-7_9
Published: 05 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49054-0
Online ISBN: 978-3-319-49055-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Bilingual Data Selection Using a Continuous Vector-Space Representation

Abstract

Similar content being viewed by others

Mixing Textual Data Selection Methods for Improved In-Domain Data Adaptation

Survey of data-selection methods in statistical machine translation

Adaptation of Machine Translation Models with Back-Translated Data Using Transductive Data Selection Methods

Keywords

1 Introduction

2 Related Work

3 Data Selection Approaches

3.1 Continuous Vector-Space Representation of Words

3.2 Continuous Vector-Space Representation of Sentences

3.3 DS Using Vector Space Representations of Sentences

3.4 Bilingual DS Using Vector Space Representations of Sentences

4 Cross-Entropy Difference Method

5 Experiments

5.1 Experimental Setup

5.2 Comparative with Cross-Entropy Selection

5.3 Comparative with Bilingual Cross-Entropy Selection

5.4 Summary of the Results

6 Conclusion and Future Work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships