We address the problem of automatically cleaning a translation memory (TM) by identifying problematic translation units (TUs). In this context, we treat as “problematic TUs” those containing useless translations from the point of view of the user of a computer-assisted translation tool. We approach TM cleaning both as a supervised and as an unsupervised learning problem. In both cases, we take advantage of Translation Memory open-source purifier, an open-source TM cleaning tool also presented in this paper. The two learning paradigms are evaluated on different benchmarks extracted from MyMemory, the world’s largest public TM. Our results indicate the effectiveness of the supervised approach in the ideal condition in which labelled training data is available, and the viability of the unsupervised solution for challenging situations in which training data is not accessible.

In addition to these components, CAT tools can be equipped with concordancers, terminology databases, spell/grammar checkers, indexers and project management functions.
The translation in (d) contains “somministARzione” instead of “somministRAzione”.
It is worth remarking that not all existing TMs are private resources carefully constructed by expert human translators. Some of them are collaboratively built by anonymous contributors and can also include TUs automatically extracted from the Web. In such cases, major translation errors are quite frequent.
Although not usable in practice, however, the simplest “majority voting” baseline marking all the TUs as “good” achieves even better results on the same highly imbalanced data.
For instance, judging the usefulness of a TU whose target side has missing/extra words is a highly subjective task. It is likely that the perceived severity of these errors will be inversely proportional to sentence length.
The evaluation on the third subtask, which corresponds to a multi-class interpretation of the problem, is not discussed for the sake of conciseness.
In the NLP4TM shared task data, the threshold is set to 2.5.
A detailed summary of the FBK HLT-MT submissions to the 1st translation memory cleaning shared task is available at http://rgcl.wlv.ac.uk/wp-content/uploads/2016/05/fbkhltmt-workingnote.
As stated in Sect. 1, we call this method “unsupervised” to convey the idea that, although it relies on supervised learning algorithms, it bypasses the need for the supervision provided by manual labels.
Indeed, each model is obtained with different features (e.g. the A group) and by learning from training data having different label distributions (e.g. inferred using B and C).
All the improvements over the MT-based system are statistically significant (\(\rho <0.05\) measured by approximate randomization), while only the result for Z \(=\) 50 K and k \(=\) 15 K is statistically significantly better than the Barbu15 classifier.
In total, this corresponds to 30% of the whole test set.
Negri, M., Ataman, D., Sabet, M.J. et al. Automatic translation memory cleaning. Machine Translation 31, 93–115 (2017). https://doi.org/10.1007/s10590-017-9191-5
DOI: https://doi.org/10.1007/s10590-017-9191-5