Abstract
As machine translation (MT) tools have become mainstream, machine translated text has increasingly appeared on multilingual websites. Trustworthy multilingual websites are used as training corpora for statistical machine translation tools; large amounts of MT text in training data may make such products less effective. We performed three experiments to determine whether a support vector machine (SVM) could distinguish machine translated text from human written text (both original text and human translations). Machine translated versions of the Canadian Hansard were detected with an F-measure of 0.999. Machine translated versions of six Government of Canada web sites were detected with an F-measure of 0.98. We validated these results with a decision tree classifier. An experiment to find MT text on Government of Ontario web sites using Government of Canada training data was unfruitful, with a high rate of false positives. Machine translated text appears to be learnable and detectable when using a similar training corpus.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Helft, M.: Googles Computing Power Refines Translation Tool. In: New York Times (March 8, 2010), A1, Retrieved from http://www.nytimes.com/2010/03/09/technology/09translate.html?nl=technology&emc=techupdateema1
Baroni, M., Bernardini, S.: A New Approach to the Study of Translationese: Machine-learning the Difference between Original and Translated Text. Literary and Linguistic Computing 21(3), 259–274 (2006)
Kurokawa, D., Goutte, C., Isabelle, P.: Automatic detection of translated text and its impact on machine translation. In: MT Summit XII: Proceedings of the Twelfth Machine Translation Summit, Ottawa, Ontario, Canada, August 26-30, pp. 81–88 (2009)
Gellerstam, M.: Translationese in Swedish Novels Translated from English. In: Wollin, L., Lindquist, H. (eds.) Translation Studies in Scandinavia: Proceedings from the Scandinavian Symposium on Translation Theory (SSOTT) II, Lund, June 14-15, pp. 88–95 (1985)
Santos, D.: On the use of parallel texts in the comparison on languages. Actas do XI Encontro da Associação Portuguesa de LinguÃstica, Lisboa, 2-4 de Outubro de 1995, 217–239 (1995)
Santos, D.: On grammatical translationese. In: Koskenniemi, K. (ed.) Short Papers Presented at the Tenth Scandinavian Conference on Computational Linguistics, Helsinki, pp. 29–30 (1995)
Koppel, M., Ordan, N.: Translationese and Its Dialects. In: Proceedings of ACL, Portland OR, pp. 1318–1326 (June 2011)
Carpuat, M.: One Translation per Discourse. In: Agirre, E., Márquez, L., Wicentowski, R. (eds.) SEW-2009 Semantic Evaluations: Recent Achievements and Future Directions, pp. 19–27 (2009)
Lembersky, G., Ordan, N., Wintner, S.: Language models for machine translation: original vs. translated texts. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, July 27-31, pp. 363–374 (2011)
Ilisei, I., Inkpen, D.: Translationese Traits in Romanian Newspapers: A Machine Learning Approach. In: Gelbukh, A. (ed.) International Journal of Computational Linguistics and Applications (2011) (in press)
Ilisei, I., Inkpen, D., Pastor, G.C., Mitkov, R.: Identification of Translationese: A Machine Learning Approach. In: Gelbukh, A. (ed.) CICLing 2010. LNCS, vol. 6008, pp. 503–511. Springer, Heidelberg (2010)
Popescu, M.: Studying Translationese at the Character Level. In: Proceedings of Recent Advances in Natural Language Processing, pp. 634–639 (2011)
Uchimoto, K., Hayashida, N., Ishida, T., Isahara, H.: Automatic detection and semi-automatic revision of non-machine-translatable parts of a sentence. In: LREC-2006: Fifth International Conference on Language Resources and Evaluation. Proceedings, Genoa, Italy, May 22-28, pp. 703–708 (2006)
Russell, G.: Automatic detection of translation errors: the TransCheck system. In: Translating and the Computer 27: Proceedings of the Twenty-Seventh International Conference on Translating and the Computer, London, 17, November 24-25, Aslib, London (2005)
Melamed, D.: Automatic detection of omissions in translations. In: Coling 1996: The 16th International Conference on Computational Linguistics: Proceedings, Center for Sprogteknologi, Copenhagen, August 5-9, pp. 764–769 (1996)
Somers, H., Gaspari, F., Niño, A.: Detecting inappropriate use of free online machine translation by language students. A special case of plagiarism detection. In: EAMT-2006: 11th Annual Conference of the European Association for Machine Translation, Oslo, Norway, June 19-20, pp. 41–48 (2006)
Germann, U. (ed.): Aligned Hansards of the 36th Parliament of Canada Release 2001-1a (2001), Retrieved from http://www.isi.edu/natural-language/download/hansard/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Carter, D., Inkpen, D. (2012). Searching for Poor Quality Machine Translated Text: Learning the Difference between Human Writing and Machine Translations. In: Kosseim, L., Inkpen, D. (eds) Advances in Artificial Intelligence. Canadian AI 2012. Lecture Notes in Computer Science(), vol 7310. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30353-1_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-30353-1_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-30352-4
Online ISBN: 978-3-642-30353-1
eBook Packages: Computer ScienceComputer Science (R0)