Automatic translation memory cleaning

Negri, Matteo; Ataman, Duygu; Sabet, Masoud Jalili; Turchi, Marco; Federico, Marcello

doi:10.1007/s10590-017-9191-5

Automatic translation memory cleaning

Published: 10 February 2017

Volume 31, pages 93–115, (2017)
Cite this article

Machine Translation

Matteo Negri ORCID: orcid.org/0000-0002-8811-4330¹,
Duygu Ataman²,
Masoud Jalili Sabet³,
Marco Turchi¹ &
…
Marcello Federico¹

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

We address the problem of automatically cleaning a translation memory (TM) by identifying problematic translation units (TUs). In this context, we treat as “problematic TUs” those containing useless translations from the point of view of the user of a computer-assisted translation tool. We approach TM cleaning both as a supervised and as an unsupervised learning problem. In both cases, we take advantage of Translation Memory open-source purifier, an open-source TM cleaning tool also presented in this paper. The two learning paradigms are evaluated on different benchmarks extracted from MyMemory, the world’s largest public TM. Our results indicate the effectiveness of the supervised approach in the ideal condition in which labelled training data is available, and the viability of the unsupervised solution for challenging situations in which training data is not accessible.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The first Automatic Translation Memory Cleaning Shared Task

Article 01 December 2016

Combining off-the-shelf components to clean a translation memory

Article 01 December 2016

Autogenerated MQM Data for Quality Estimation Based on Sequence Labeling

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

In addition to these components, CAT tools can be equipped with concordancers, terminology databases, spell/grammar checkers, indexers and project management functions.
http://rgcl.wlv.ac.uk/nlp4tm2016/shared-task/.
The translation in (d) contains “somministARzione” instead of “somministRAzione”.
It is worth remarking that not all existing TMs are private resources carefully constructed by expert human translators. Some of them are collaboratively built by anonymous contributors and can also include TUs automatically extracted from the Web. In such cases, major translation errors are quite frequent.
http://www.xbench.net/.
https://mymemory.translated.net/.
https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory.
Although not usable in practice, however, the simplest “majority voting” baseline marking all the TUs as “good” achieves even better results on the same highly imbalanced data.
http://alt.qcri.org/semeval2016/task1/.
https://github.com/hlt-mt/TMOP.
For instance, judging the usefulness of a TU whose target side has missing/extra words is a highly subjective task. It is likely that the perceived severity of these errors will be inversely proportional to sentence length.
The evaluation on the third subtask, which corresponds to a multi-class interpretation of the problem, is not discussed for the sake of conciseness.
In the NLP4TM shared task data, the threshold is set to 2.5.
A detailed summary of the FBK HLT-MT submissions to the 1st translation memory cleaning shared task is available at http://rgcl.wlv.ac.uk/wp-content/uploads/2016/05/fbkhltmt-workingnote.
As stated in Sect. 1, we call this method “unsupervised” to convey the idea that, although it relies on supervised learning algorithms, it bypasses the need for the supervision provided by manual labels.
Indeed, each model is obtained with different features (e.g. the A group) and by learning from training data having different label distributions (e.g. inferred using B and C).
https://www.bing.com/translator/.
All the improvements over the MT-based system are statistically significant ($\rho <0.05$ measured by approximate randomization), while only the result for Z $=$ 50 K and k $=$ 15 K is statistically significantly better than the Barbu15 classifier.
In total, this corresponds to 30% of the whole test set.

References

Abdul Rauf S, Schwenk H (2011) Parallel sentence generation from comparable corpora for improved SMT. Mach Transl 25(4):341–375
Article Google Scholar
Arthern P (1979) Machine translation and computerized terminology systems: a translator’s viewpoint. In: Translating and the computer, proceedings of a seminar, London, UK, pp 77–108
Barbu E (2015) Spotting false translation segments in translation memories. In: Proceedings of the workshop on natural language processing for translation memories, Hissar, Bulgaria, pp 9–16
Barbu E, Parra Escartín C, Bentivogli L, Negri M, Turchi M, Federico M, Mastrostefano L, Orasan C (2016) 1st shared task on automatic translation memory cleaning. In: Proceedings of the 2nd Workshop on natural language processing for translation memories (NLP4TM 2016). Portorož, Slovenia, pp 1–5
Biçici E, Dymetman M (2008) Dynamic translation memory: using statistical machine translation to improve translation memory fuzzy matches. In: Proceedings of the 9th international conference on computational linguistics and intelligent text processing, CICLing’08, Haifa, Israel, pp 454–465
Bloodgood M, Strauss B (2014) Translation memory retrieval methods. In: Proceedings of the 14th conference of the European chapter of the association for computational linguistics, Gothenburg, Sweden, pp 202–210
Brown PF, Della Pietra SA, Della Pietra VJ, Mercer RL (2003) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19(2):263–311
Google Scholar
Burchardt A, Lommel A (2014) Practical guidelines for the use of MQM in scientific research on translation quality. Technical report, DFKI, Berlin, Germany
Camargo de Souza JG, Buck C, Turchi M, Negri M (2013) FBK-UEdin participation to the WMT13 quality estimation shared task. In: Proceedings of the eighth workshop on statistical machine translation, Sofia, Bulgaria, pp 352–358
Chatzitheodoroou K (2015) Improving translation memory fuzzy matching by paraphrasing. In: Proceedings of the workshop on natural language processing for translation memories, Hissar, Bulgaria, pp 24–30
Chu C, Nakazawa T, Kurohashi S (2013) Chinese–Japanese parallel sentence extraction from quasi–comparable corpora. In: Proceedings of the sixth workshop on building and using comparable corpora, Sofia, Bulgaria, pp 34–42
Cotterell R, Schütze H, Eisner J (2016) Morphological smoothing and extrapolation of word embeddings. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), Berlin, Germany, pp 1651–1660
Denkowski M, Hanneman G, Lavie A (2012) The CMU-avenue French–English translation system. In: Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada, pp 261–266
Dyer C, Clark J, Lavie A, Smith NA (2011) Unsupervised word alignment with arbitrary features. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies-volume 1, Portland, Oregon, USA, pp 409–419
Eetemadi S, Lewis W, Toutanova K, Radha H (2015) Survey of data-selection methods in statistical machine translation. Mach Transl 29(3–4):189–223
Article Google Scholar
Gao Q, Vogel S (2008) Parallel implementations of word alignment tool. In: Proceedings of the ACL 2008 software engineering, testing, and quality assurance workshop, Columbus, Ohio, USA, pp 49–57
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42
Article MATH Google Scholar
Gupta R, Bechara H, Orasan C (2014) Intelligent translation memory matching and retrieval metric exploiting linguistic technology. In: Proceedings of translating and the computer 36, London, UK, pp 86–89
Gupta R, Orasan C, Zampieri M, Vela M, Van Genabith J (2015) Can translation memories afford not to use paraphrasing? In: Proceedings of the 18th annual conference of the European association for machine translation, Antalya, Turkey, pp 35–42
Khadivi S, Ney H (2005) Automatic filtering of bilingual corpora for statistical machine translation. In: Proceedings of natural language processing and information systems, 10th international conference on applications of natural language to information systems, Alicante, Spain, pp 263–274
Koehn P, Senellart J (2010) Convergence of translation memory and statistical machine translation. In: Proceedings of AMTA workshop on MT research and the translation industry, Denver, CO, USA, pp 21–31
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10:707–710
MathSciNet MATH Google Scholar
Lommel A (2015) Multidimensional quality metrics (MQM) definition. Technical report, DFKI, Berlin, Germany
Lui M, Baldwin T (2012) langid.py: an off-the-shelf language identification tool. In: Proceedings of the ACL 2012 system demonstrations, Jeju Island, Korea, pp 25–30
Ma Y, He Y, Way A, Van Genabith J (2011) Consistent translation using discriminative learning: a translation memory-inspired approach. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, volume 1, Portland, Oregon, USA, pp 1239–1248
Meinshausen N, Bühlmann P (2010) Stability selection. J R Stat Soc B Stat Methodol 72(4):417–473
Article MathSciNet Google Scholar
Munteanu DS, Marcu D (2005) Improving machine translation performance by exploiting non-parallel corpora. Comput Linguist 31(4):477–504
Article Google Scholar
Nakazawa T, Kurohashi S (2011) Bayesian subtree alignment model based on dependency trees. In: Proceedings of 5th international joint conference on natural language processing, Chiang Mai, Thailand, pp 794–802
Negri M, Marchetti A, Mehdad Y, Bentivogli L, Giampiccolo D (2012) Semeval-2012 task 8: cross-lingual textual entailment for content synchronization. In: Proceedings of the 6th international workshop on semantic evaluation (SemEval 2012), Montréal, Canada, pp 399–407
Noreen EW (1989) Computer intensive methods for testing hypothesis. An introduction. Wiley, New York
Google Scholar
Rarrick S, Quirk C, Lewis W (2011) MT detection in web-scraped parallel corpora. In: MT summit XIII: the thirteenth machine translation summit, Xiamen, China, pp 422–429
Riesa J, Marcu D (2012) Automatic parallel fragment extraction from noisy data. In: Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: human language technologies, Montréal, Canada, pp 538–542
Sikes R (2007) Fuzzy matching in theory and practice. Multilingual 18(6):39–43
Google Scholar
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006: proceedings of the 7th conference of the association for machine translation in the Americas, visions for the future of machine translation, Cambridge, Massachusetts, USA, pp 223–231
Søgaard A, Agić V, Martínez Alonso H, Plank B, Bohnet B, Johannsen A (2015) Inverted indexing for cross-lingual NLP. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: long papers), Beijing, China, pp 1713–1722
Specia L, Cancedda N, Dymetman M, Turchi M, Cristianini N (2009) Estimating the sentence-level quality of machine translation systems. In: Proceedings of the 13th annual conference of the European association for machine translation (EAMT-2009), Barcelona, Spain, pp 28–35
Tillmann C (2009) A beam-search extraction algorithm for comparable data. In: Proceedings of the ACL-IJCNLP 2009 conference short papers, Singapore, pp 225–228
Turchi M, Negri M, Federico M (2013) Coping with the subjectivity of human judgements in MT quality estimation. In: Proceedings of the eighth workshop on statistical machine translation, Sofia, Bulgaria, pp 240–251
Turchi M, Negri M, Federico M (2014) Data-driven annotation of binary MT quality estimation corpora based on human post-editions. Mach Transl 28(3):281–308
Article Google Scholar
Vanallemeersch T, Vandeghinste V (2014) Improving fuzzy matching through syntactic knowledge. In: Proceedings of translating and the computer 36, London, pp 217–227
Vanallemeersch T, Vandeghinste V (2015) Assessing linguistically aware fuzzy matching in translation memories. In: Proceedings of the 18th annual conference of the European association for machine translation, Antalya, Turkey, pp 153–160
Wang K, Zong C, Su KY (2013) Integrating translation memory into phrase-based machine translation during decoding. In: Proceedings of the 51st annual meeting of the association for computational linguistics (volume 1: long papers), Sofia, Bulgaria, pp 11–21
Yeh A (2000) More accurate tests for the statistical significance of result differences. In: The 18th international conference on computational linguistics, COLING 2000 in Europe, proceedings of the conference, volume 2, Saarbrücken, Germany, pp 947–953
Zhechev V, Van Genabith J (2010) Seeding statistical machine translation with translation memory output through tree-based structural alignment. In: Proceedings of the 4th workshop on syntax and structure in statistical translation, Beijing, China, pp 43–51

Download references

Acknowledgements

This work has been partially supported by the EC-funded project ModernMT (H2020 Grant Agreement No. 645487). The work carried out at FBK by Masoud Jalili Sabet was sponsored by the European Association for Machine Translation, through the EAMT summer internships 2015 program. The authors would also like to thank Translated for providing a dump of MyMemory, Eduard Barbu for running his system on our data, and the anonymous reviewers for their insightful comments.

Author information

Authors and Affiliations

Fondazione Bruno Kessler, Povo, Trento, Italy
Matteo Negri, Marco Turchi & Marcello Federico
Fondazione Bruno Kessler, Università degli Studi di Trento, Povo, Trento, Italy
Duygu Ataman
School of Electrical and Computer Engineering, University of Tehran, Tehran, Iran
Masoud Jalili Sabet

Authors

Matteo Negri
View author publications
You can also search for this author inPubMed Google Scholar
Duygu Ataman
View author publications
You can also search for this author inPubMed Google Scholar
Masoud Jalili Sabet
View author publications
You can also search for this author inPubMed Google Scholar
Marco Turchi
View author publications
You can also search for this author inPubMed Google Scholar
Marcello Federico
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Matteo Negri.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Negri, M., Ataman, D., Sabet, M.J. et al. Automatic translation memory cleaning. Machine Translation 31, 93–115 (2017). https://doi.org/10.1007/s10590-017-9191-5

Download citation

Received: 07 June 2016
Accepted: 24 January 2017
Published: 10 February 2017
Issue Date: September 2017
DOI: https://doi.org/10.1007/s10590-017-9191-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic translation memory cleaning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

The first Automatic Translation Memory Cleaning Shared Task

Combining off-the-shelf components to clean a translation memory

Autogenerated MQM Data for Quality Estimation Based on Sequence Labeling

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now