Investigating translated Chinese and its variants using machine learning

Hai Hu; Sandra Kübler

doi:10.1017/S1351324920000182

Investigating translated Chinese and its variants using machine learning

Published online by Cambridge University Press: 03 April 2020

Hai Hu

and

Sandra Kübler

Show author details

Hai Hu*: Affiliation:
Department of Linguistics, Indiana University, Bloomington, IN, USA
Sandra Kübler: Affiliation:
Department of Linguistics, Indiana University, Bloomington, IN, USA
*: *Corresponding author. E-mail: huhai@indiana.edu

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Translations are generally assumed to share universal features that distinguish them from texts that are originally written in the same language. Thus, we can argue that these translations constitute their own variety of a language, often called translationese. However, translations are also influenced by their source languages and thus show different characteristics depending on the source language. Consequently, we argue that these variants constitute different “dialects” of translations into the same target language. Studies using machine learning techniques on Indo-European languages have investigated the universal characteristics of translationese and how translations from various source languages differ. However, for typologically very different languages such as Chinese, there are only few corpus studies that tap into the intricate relation between translations and the originals, as well as into the relations among translations themselves. In this contribution, we investigate the following questions: (1) What are the characteristics of Chinese translationese, both in general and with respect to different source languages? (2) Can we find differences not only at the lexical but also on the syntactic level? and (3) Based on the characteristics found in the previous questions, which of the proposed laws and universals can we corroborate based on our evidence from Chinese? We use machine learning to operationalize determining the importance of different characteristics and comparing their importance for our Chinese dataset with characteristics previously reported in studies on English. In addition, our methodology allows us to add syntactic features, which have rarely been used to study translations into Chinese. Our results show that Chinese translations as a whole can be reliably distinguished from non-translations, even based on only five features. More interestingly, typological traces from the source languages can often be found in their translations, therefore creating what we call dialects of translationese. For instance, translations from two Altaic languages exhibit more noun repetition and less frequent use of pronouns. Additionally, some characteristics that are not discriminative for English work well for Chinese, possibly because the distance between Chinese and the source languages is greater than that in English studies.

Keywords

Translationese Text classification Chinese Translation universal

Type: Article
Information: Natural Language Engineering , Volume 27 , Issue 3 , May 2021 , pp. 339 - 372

DOI: https://doi.org/10.1017/S1351324920000182 [Opens in a new window]
Copyright: © Cambridge University Press 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Baker, M. (1993). Corpus linguistics and translation studies: Implications and applications. In Baker, M., Francis, G. and Tognini-Bonelli, E. (eds), Text and Technology: In Honour of John Sinclair. Amsterdam: John Benjamins, pp. 233–250.Google Scholar

Baker, M. (1995). Corpora in translation studies: An overview and some suggestions for future research. Target. International Journal of Translation Studies, 7(2), 223–243.CrossRef Google Scholar

Baker, M. (1996). Corpus-based translation studies: The challenges that lie ahead. In Somers, H. (ed), Terminology, LSP and Translation. Studies in Language Engineering in Honour of Juan C. Sager, vol. 18. Amsterdam and Philadelphia: Benjamins, pp. 175–186.Google Scholar

Baroni, M. and Bernardini, S. (2005). A new approach to the study of translationese: Machine-learning the difference between original and translated text. Literary and Linguistic Computing 21(3), 259–274.CrossRef Google Scholar

Becher, V. (2011). Explicitation and Implicitation in Translation. A Corpus-Based Study of English-German and German-English Translations of Business Texts. PhD Thesis, University of Hamburg.Google Scholar

Ben-Ari, N. (1998). The ambivalent case of repetitions in literary translation. Avoiding repetitions: A “universal” of translation? Meta: Journal des Traducteurs/Meta: Translators’ Journal, 43(1), 68–78.CrossRef Google Scholar

Blum-Kulka, S. (1986). Shifts of cohesion and coherence in translation. In House, J. and Blum-Kulka, S. (eds), Interlingual and Intercultural Communication: Discourse and Cognition in Translation and Second Language Acquisition Studies. Gunter Narr, Tuebingen, Germany, pp. 17–35.Google Scholar

Bykh, S. and Meurers, D. (2014). Exploring syntactic features for native language identification: A variationist perspective on feature encoding and ensemble optimization. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics, Dublin, Ireland, pp. 1962–1973.Google Scholar

Cappelle, B. and Loock, R. (2017). Typological differences shining through: The case of phrasal verbs in translated English. In De Sutter, G., Lefer, M.-A. and Delaere, I. (eds), Empirical Translation Studies: New Theoretical and Methodological Traditions. Walter de Gruyter, Berlin, Germany, pp. 235–263.Google Scholar

Cartoni, B., Zufferey, S., Meyer, T. and Popescu-Belis, A. (2011). How comparable are parallel corpora? Measuring the distribution of general vocabulary and connectives. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, Portland, OR, pp. 78–86.Google Scholar

Chen, J.W. (2006). Explicitation Through the Use of Connectives in Translated Chinese: A Corpus-Based Study . PhD Thesis, The University of Manchester.Google Scholar

Chen, P. (1987). Discourse analysis of zero anaphora in Chinese. Chinese Philology (In Chinese). 5, 363–378 Google Scholar

Chen, Z., Boston, M.F. and Hale, J.T. (2009). Using entropy to evaluate child language performance. In The 22nd CUNY Conference on Human Sentence Processing, Davis, CA.Google Scholar

Church, K.W. and Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics 16(1), 22–29.Google Scholar

Da, J. (2004). A corpus-based study of character and bigram frequencies in Chinese e-texts and its implications for Chinese language instruction. In Proceedings of the Fourth International Conference on New Technologies in Teaching and Learning Chinese, Beijing, China, pp. 501–511.Google Scholar

De Sutter, G., Lefer, M.-A. and Delaere, I. (eds). (2017). Empirical Translation Studies: New Methodological and Theoretical Traditions, vol. 300. Walter de Gruyter, Berlin, Germnay.CrossRef Google Scholar

Evert, S. and Neumann, S. (2017). The impact of translation direction on characteristics of translated texts: A multivariate analysis for English and German. In De Sutter, G., Lefer, M.-A. and Delaere, I. (eds), Empirical Translation Studies: New Theoretical and Methodological Traditions. Walter de Gruyter, Berlin, Germany, pp. 47–80.Google Scholar

Ferraresi, A. and Miličević, M. (2017). 5 phraseological patterns in interpreting and translation. Similar or different? In De Sutter, G., Lefer, M.-A. and Delaere, I. (eds), Empirical Translation Studies: New Theoretical and Methodological Traditions. Walter de Gruyter, Berlin, Germany, pp. 157–182.Google Scholar

Frawley, W. (1984). Prolegomenon to a theory of translation. In Frawley, W. (ed), Translation: Literary, Linguistic and Philosophical Perspectives. Associated University Press, London, pp. 159–175.Google Scholar

Gellerstam, M. (1986). Translationese in Swedish novels translated from English. In Wollin, L. and Lindquist, H. (eds), Translation Studies in Scandinavia, vol. 1. CWK Gleerup, pp. 88–95.Google Scholar

Graff, D. (2007). Chinese Gigaword, 3rd Edn. LDC Catalog No.: LDC2007T38, ISBN: 1-58563-455-7.Google Scholar

Grieve, J. (2007). Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing 22(3), 251–270.CrossRef Google Scholar

Hale, J. (2016). Information-theoretical complexity metrics. Language and Linguistics Compass 10(9), 397–412.CrossRef Google Scholar

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P. and Witten, I.H. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter 11(1), 10–18.CrossRef Google Scholar

He, Y. (2008). A Study of Grammatical Features in Europeanized Chinese. Commercial Press (In Chinese), Beijing.Google Scholar

Hu, H., Li, W. and Kübler, S. (2018). Detecting syntactic features of translated Chinese. In Proceedings of the 2nd Workshop on Stylistic Variations at NAACL-HLT 2018, New Orleans, LA, pp. 20–28.CrossRef Google Scholar

Hu, X., Xiao, R. and Hardie, A. (2016). How do English translations differ from non-translated English writings? A multi-feature statistical model for linguistic variation analysis. Corpus Linguistics and Linguistic Theory 15(2), 347–382.CrossRef Google Scholar

Ilisei, I. and Inkpen, D. (2011). Translationese traits in Romanian newspapers: A machine learning approach. International Journal of Computational Linguistics and Applications 2(1–2), 319–32.Google Scholar

Ilisei, I., Inkpen, D., Pastor, G.C. and Mitkov, R. (2010). Identification of translationese: A machine learning approach. In International Conference on Intelligent Text Processing and Computational Linguistics, Iasi, Romania, pp. 503–511.CrossRef Google Scholar

Ke, F. (2005). Fanyi zhong de xian he yin (implicitation and explicitation in translations). Foreign Language Teaching and Research 37(4), 303–307 (In Chinese).Google Scholar

Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of the 10th Machine Translation Summit, Phuket, Thailand, pp. 79–86.Google Scholar

Koppel, M. and Ordan, N. (2011). Translationese and its dialects. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, pp. 1318–1326.Google Scholar

Kunilovskaya, M. and Kutuzov, A. (2017). Testing target text fluency: A machine learning approach to detecting syntactic translationese in English-Russian translation. In Menzel, K., Lapshinova-Koltunski, E. and Kunz, K. (eds), New Perspectives on Cohesion and Coherence. Language Science Press, Berlin, pp. 75–104.Google Scholar

Kwon, N., Kluender, R., Kutas, M. and Polinsky, M. (2013). Subject/object processing asymmetries in Korean relative clauses: Evidence from ERP data. Language 89(3), 537.CrossRef Google Scholar PubMed

Laviosa-Braithwaite, S. (1996). The English Comparable Corpus (ECC): A Resource and a Methodology for the Empirical Study of Translation. PhD Thesis, University of Manchester.Google Scholar

Lembersky, G., Ordan, N. and Wintner, S. (2012). Language models for machine translation: Original vs. translated texts. Computational Linguistics 38(4), 799–825.CrossRef Google Scholar

Levy, R. and Andrew, G. (2006). Tregex and Tsurgeon: Tools for querying and manipulating tree data structures. In Proceedings of the Fifth International Conference on Language Resources and Evaluation, Genoa, Italy, pp. 2231–2234.Google Scholar

Lin, C.-J.C. (2011). Chinese and English relative clauses: Processing constraints and typological consequences. In Proceedings of the 23rd North American Conference on Chinese Linguistics (NACCL-23), Eugene, OR.Google Scholar

Lin, C.-J.C. (2018). Subject prominence and processing filler-gap dependencies in prenominal relative clauses: The comprehension of possessive relative clauses and adjunct relative clauses in Mandarin Chinese. Language 94, 758–797.CrossRef Google Scholar

Lin, C.-J.C. and Hu, H. (2018). Syntactic complexity as a measure of linguistic authenticity in modern Chinese. In 26th Annual Conference of International Association of Chinese Linguistics and the 20th International Conference on Chinese Language and Culture, Madison, WI.Google Scholar

Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics Beijing, 15(4), 474–496.CrossRef Google Scholar

Lv, S. (1942). A Sketch of Chinese Grammar. Commercial Press (In Chinese).Google Scholar

Malmasi, S. and Dras, M. (2018). Native language identification with classifier stacking and ensembles. Computational Linguistics 44(3), 403–446.CrossRef Google Scholar

Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J. and McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MD, pp. 55–60.CrossRef Google Scholar

Mauranen, A. and Kujamäki, P. (eds) (2004). Translation Universals: Do they Exist?, vol. 48. John Benjamins, Amsterdam.CrossRef Google Scholar

Meyer, T. and Webber, B. (2013). Implicitation of discourse connectives in (machine) translation. In Proceedings of the Workshop on Discourse in Machine Translation, pp. 19–26.Google Scholar

Olohan, M. and Baker, M. (2000). Reporting that in translated English. Evidence for subconscious processes of explicitation? Across Languages and Cultures 1(2), 141–158.CrossRef Google Scholar

Pápai, V. (2004). Explicitation: A universal of translated text? In Mauranen, A. and Kujamäki, P. (eds), Translation Universals: Do they exist? John Benjamins, pp. 143–164.CrossRef Google Scholar

Puurtinen, T. (2004). Explicitation of clausal relations: A corpus-based analysis of clause connectives in translated and non-translated Finnish children’s literature. In Mauranen, A. and Kujamäki, P. (eds), Translation Universals: Do they exist? John Benjamins, pp. 165–176.CrossRef Google Scholar

Rabinovich, E., Nisioi, S., Ordan, N. and Wintner, S. (2016). On the similarities between native, non-native and translated texts. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, pp. 1870–1881.CrossRef Google Scholar

Rabinovich, E. and Wintner, S. (2015). Unsupervised identification of translationese. Transactions of the Association of Computational Linguistics 3(1), 419–432.CrossRef Google Scholar

Rubino, R., Lapshinova-Koltunski, E. and van Genabith, J. (2016). Information density and quality estimation features as translationese indicators for human translation classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, pp. 960–970.CrossRef Google Scholar

Shannon, C.E. (1948). A mathematical theory of communication. Bell System Technical Journal 27(3), 379–423.CrossRef Google Scholar

Swanson, B. and Charniak, E. (2012). Native language detection with tree substitution grammars. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island, South Korea pp. 193–197.Google Scholar

Teich, E. (2003). Cross-Linguistic Variation in System and Text: A Methodology for the Investigation of Translations and Comparable Texts. Walter de Gruyter Berlin.CrossRef Google Scholar

Toury, G. (1978). The nature and role of norms in translation. In Holmes, J., Lambert, J. and van den Broeck, R. (eds), Literature and Translation: New Perspectives in Literary Studies. Acco, Leuven.Google Scholar

Toury, G. (1995). Descriptive Translation Studies and Beyond. John Benjamins, Amsterdam.CrossRef Google Scholar

Volansky, V., Ordan, N. and Wintner, S. (2013). On the features of translationese. Digital Scholarship in the Humanities 30(1), 98–118.CrossRef Google Scholar

Wang, L. (1943). Contemporary Grammar of Chinese. Commercial Press (In Chinese) Beijing.Google Scholar

Wang, L. (1944). Theory of Chinese Grammar. Commercial Press (In Chinese) Beijing.Google Scholar

Wang, L. (1958). History of the Chinese Language. Zhonghua Book Company (In Chinese) Beijing.Google Scholar

Xiao, R. (2010). How different is translated Chinese from native Chinese?: A corpus-based study of translation universals. International Journal of Corpus Linguistics 15(1), 5–35.CrossRef Google Scholar

Xiao, R. and Hu, X. (2015). Corpus-Based Studies of Translational Chinese in English-Chinese Translation. Springer Berlin.CrossRef Google Scholar

Xue, N., Xia, F., Chiou, F.-D. and Palmer, M. (2005). The Penn Chinese TreeBank: Phrase structure annotation of a large corpus. Natural Language Engineering 11(2), 207–238.CrossRef Google Scholar

Zhu, D. (1985). Dialogues in Grammar. Commercial Press (In Chinese) Beijing.Google Scholar

Article contents

Investigating translated Chinese and its variants using machine learning

Abstract

Keywords

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests