Abstract
Terminology translation plays a crucial role in domain-specific machine translation (MT). Preservation of domain knowledge from source to target is arguably the most concerning factor for clients in translation industry, especially for critical domains such as medical, transportation, military, legal and aerospace. Evaluation of terminology translation, despite its huge importance in the translation industry, has been a less examined area in MT research. Term translation quality in MT is usually measured with domain experts, either in academia or industry. To the best of our knowledge, as of yet there is no publicly available solution to automatically evaluate terminology translation in MT. In particular, manual intervention is often needed to evaluate terminology translation in MT, which, by nature, is a time-consuming and highly expensive task. In fact, this is unimaginable in an industrial setting where customised MT systems are often needed to be updated for many reasons (e.g. availability of new training data or leading MT techniques). Hence, there is a genuine need to have a faster and less expensive solution to this problem, which could aid the end-users to instantly identify term translation problems in MT. In this study, we propose an automatic evaluation metric, TermEval, for evaluating terminology translation in MT. To the best of our knowledge, there is no gold-standard dataset available for measuring terminology translation quality in MT. In the absence of gold-standard evaluation test set, we semi-automatically create a gold-standard dataset from English–Hindi judicial domain parallel corpus.
We trained state-of-the-art phrase-based SMT (PB-SMT) and neural MT (NMT) models on two translation directions: English-to-Hindi and Hindi-to-English, and use TermEval to evaluate their performance on terminology translation over the created gold-standard test set. In order to measure the correlation between TermEval scores and human judgments, translations of each source terms (of the gold-standard test set) is validated with human evaluator. High correlation between TermEval and human judgements manifests the effectiveness of the proposed terminology translation evaluation metric. We also carry out comprehensive manual evaluation on terminology translation and present our observations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
Since Moses can supply word-to-word alignments with its output (i.e. translation) from the phrase table (if any), one can exploit this information to trace target translation of a source term in the output. However, there are few potential problems with the alignment information, e.g. there could be null or erroneous alignments. Note that, at the time of this work, the transformer models of MarianNMT could not supply word-alignments (i.e. attention weights). In fact, our intention is to make our proposed evaluation method as generic as possible so that it can be applied to the output of any MT system (e.g. an online commercial MT engine). This led us to abandon such dependency.
References
BitterCorpus. https://hlt-mt.fbk.eu/technologies/bittercorpus. Accessed 28 Aug 2019
Arčan, M., Turchi, M., Tonelli, S., Buitelaar, P.: Enhancing statistical machine translation with bilingual terminology in a cat environment. In: Proceedings of the 11th Biennial Conference of the Association for Machine Translation in the Americas, pp. 54–68 (2014)
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. CoRR abs/1607.06450 (2016). https://arxiv.org/abs/1607.06450
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations, pp. 1–15. San Diego, CA (2015)
Beyer, A.M., Macketanz, V., Burchardt, A., Williams, P.: Can out-of-the-box NMT beat a domain-trained Moses on technical data? In: Proceedings of EAMT User Studies and Project/Product Descriptions, pp. 41–46. Prague, Czech Republic (2017)
Bojar, O., et al.: Hindencorp - Hindi-English and Hindi-only corpus for machine translation. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC, pp. 3550–3555 (2014)
Burchardt, A., Macketanz, V., Dehdari, J., Heigold, G., Peter, J.T., Williams, P.: A linguistic evaluation of rule-based, phrase-based, and neural MT engines. Prague Bull. Math. Linguist. 108(1), 159–170 (2017)
Cherry, C., Foster, G.: Batch tuning strategies for statistical machine translation. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 427–436. Association for Computational Linguistics, Montréal, Canada (2012)
Cho, K., van Merriënboer, B., Gülçehre, Ç., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Doha, Qatar, October 2014
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960)
Denkowski, M., Lavie, A.: Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 85–91. Association for Computational Linguistics, Edinburgh, Scotland, July 2011
Durrani, N., Schmid, H., Fraser, A.: A joint sequence translation model with integrated reordering. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 1045–1054. Association for Computational Linguistics, Portland, Oregon, USA, June 2011
Farajian, M.A., Bertoldi, N., Negri, M., Turchi, M., Federico, M.: Evaluation of terminology translation in instance-based neural MT adaptation. In: Proceedings of the 21st Annual Conference of the European Association for Machine Translation, pp. 149–158. Alicante, Spain (2018)
Gage, P.: A new algorithm for data compression. C Users J. 12(2), 23–38 (1994)
Gal, Y., Ghahramani, Z.: A theoretically grounded application of dropout in recurrent neural networks. CoRR abs/1512.05287 (2016). https://arxiv.org/abs/1512.05287
Haque, R., Hasanuzzaman, M., Way, A.: Investigating terminology translation in statistical and neural machine translation: a case study on English-to-Hindi and Hindi-to-English. In: Proceedings of RANLP 2019: Recent Advances in Natural Language Processing, pp. 437–446. Varna, Bulgaria (2019)
Haque, R., Hasanuzzaman, M., Way, A.: Analysing terminology translation errors in statistical and neural machine translation. Mach. Transl. 34(2), 149–195 (2020)
Haque, R., Penkale, S., Way, A.: Bilingual termbank creation via log-likelihood comparison and phrase-based statistical machine translation. In: Proceedings of the 4th International Workshop on Computational Terminology (Computerm), pp. 42–51. Dublin, Ireland (2014)
Haque, R., Penkale, S., Way, A.: TermFinder: log-likelihood comparison and phrase-based statistical machine translation models for bilingual terminology extraction. Lang. Resour. Eval. 52(2), 365–400 (2018). https://doi.org/10.1007/s10579-018-9412-4
Hassan, H., et al.: Achieving human parity on automatic Chinese to English news translation, March 2018. ArXiv e-prints
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385
Heafield, K., Pouzyrevsky, I., Clark, J.H., Koehn, P.: Scalable modified kneser-ney language model estimation. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 690–696. Association for Computational Linguistics, Sofia, Bulgaria, August 2013
Huang, G., Zhang, J., Zhou, Y., Zong, C.: A simple, straightforward and effective model for joint bilingual terms detection and word alignment in SMT. Nat. Lang. Underst. Intell. Appl. ICCPOL/NLPCC 2016 10102, 103–115 (2016)
Huang, L., Chiang, D.: Forest rescoring: faster decoding with integrated language models. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 144–151. Association for Computational Linguistics, Prague, Czech Republic, June 2007
Junczys-Dowmunt, M., Dwojak, T., Hoang, H.: Is neural machine translation ready for deployment? A case study on 30 translation directions. ArXiv e-prints (2016)
Kalchbrenner, N., Blunsom, P.: Recurrent continuous translation models. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1700–1709. Seattle, WA, October 2013
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014). http://arxiv.org/abs/1412.6980
Koehn, P.: Statistical significance tests for machine translation evaluation. In: Lin, D., Wu, D. (eds.) Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 388–395. Association for Computational Linguistics, Barcelona, Spain, July 2004. http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Koehn.pdf
Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of MT Summit X: The Tenth Machine Translation Summit, pp. 79–86. Phuket, Thailand (2005)
Koehn, P., et al.: Moses: open source toolkit for statistical machine translation. In: ACL 2007, Proceedings of the Interactive Poster and Demonstration Sessions, pp. 177–180. Prague, Czech Republic (2007)
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: HLT-NAACL 2003: Conference Combining Human Language Technology Conference Series and the North American Chapter of the Association for Computational Linguistics Conference Series, pp. 48–54. Edmonton, AB (2003)
Kunchukuttan, A., Mehta, P., Bhattacharyya, P.: The IIT Bombay English-Hindi parallel corpus. CoRR 1710.02855 (2017). https://arxiv.org/abs/1710.02855
Lommel, A.R., Uszkoreit, H., Burchardt, A.: Multidimensional quality metrics (MQM): a framework for declaring and describing translation quality metrics. Tradumática: tecnologies de la traducció (12), 455–463 (2014)
Macketanz, V., Avramidis, E., Burchardt, A., Helcl, J., Srivastava, A.: Machine translation: phrase-based, rule-based and neural approaches with linguistic evaluation. Cybern. Inf. Technol. 17(2), 28–43 (2017). https://content.sciendo.com/view/journals/pralin/108/1/article-p159.xml
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: ACL-2002: 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. ACL, Philadelphia, PA (2002)
Pazienza, M.T., Pennacchiotti, M., Zanzotto, F.M.: Terminology extraction: an analysis of linguistic and statistical approaches. In: Sirmakessis, S. (ed.) Knowledge Mining, vol. 185, pp. 255–279. Springer, Berlin, Heidelberg (2005). https://doi.org/10.1007/3-540-32394-5_20
Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadina, I., Tadić, M., Gornostay, T.: Term extraction, tagging, and mapping tools for under-resourced languages. In: Proceedings of the 10th Conference on Terminology and Knowledge Engineering (TKE 2012), pp. 193–208. Madrid, Spain (2012)
Popović, M.: chrF: character n-gram f-score for automatic MT evaluation. In: Proceedings of the Tenth Workshop on Statistical Machine Translation, pp. 392–395. Association for Computational Linguistics, Lisbon, Portugal, September 2015
Popović, M.: Comparing language related issues for NMT and PBMT between German and English. Prague Bull. Math. Linguist. 108(1), 209–220 (2017)
Press, O., Wolf, L.: Using the output embedding to improve language models. CoRR abs/1608.05859 (2016). http://arxiv.org/abs/1608.05859
Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. CoRR abs/1511.06709 (2015). http://arxiv.org/abs/1511.06709
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725. Association for Computational Linguistics, Berlin, Germany, August 2016
Skadinš, R., Purinš, M., Skadina, I., Vasiljevs, A.: Evaluation of SMT in localization to under-resourced inflected language. In: Proceedings of the 15th International Conference of the European Association for Machine Translation (EAMT 2011), pp. 35–40. Leuven, Belgium (2011)
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: In Proceedings of the 7th Biennial Conference of the Association for Machine Translation in the Americas (AMTA-2006), pp. 223–231. Cambridge, Massachusetts (2006)
Specia, L., et al.: Translation quality and productivity: a study on rich morphology languages. In: Proceedings of MT Summit XVI, the 16th Machine Translation Summit, pp. 55–71. Asia-Pacific Association for Machine Translation, Nagoya, Japan (2017)
Stanojević, M., Sima’an, K.: Beer: better evaluation as ranking. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 414–419. Association for Computational Linguistics, Baltimore, Maryland, USA, June 2014
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, pp. 3104–3112. NIPS 2014, Montreal, Canada (2014)
Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’2012), pp. 2214–2218. Istanbul, Turkey (2012)
Toral, A., Sánchez-Cartagena, V.M.: A multifaceted evaluation of neural versus phrase-based machine translation for 9 language directions. CoRR abs/1701.02901 (2017). http://arxiv.org/abs/1701.02901
Vaswani, A., et al.: Attention is all you need. CoRR abs/1706.03762 (2017). http://arxiv.org/abs/1706.03762
Vaswani, A., Zhao, Y., Fossum, V., Chiang, D.: Decoding with large-scale neural language models improves translation. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1387–1392. Association for Computational Linguistics, Seattle, Washington, USA, October 2013
Vintar, V.: Terminology translation accuracy in statistical versus neural MT: an evaluation for the English-Slovene language pair. In: Du, J., Arcan, M., Liu, Q., Isahara, H. (eds.) Proceedings of the LREC 2018 Workshop MLP-MomenT: The Second Workshop on Multi-Language Processing in a Globalising World and The First Workshop on Multilingualism at the intersection of Knowledge Bases and Machine Translation, pp. 34–37. European Language Resources Association (ELRA), Miyazaki, Japan, May 2018
Wu, Y., et al.: Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016). http://arxiv.org/abs/1609.08144
Yeh, A.: More accurate tests for the statistical significance of result differences. In: Proceedings of the 18th Conference on Computational Linguistics - Volume 2, COLING 2000, pp. 947–953. Saarbrücken, Germany (2000)
Ziemski, M., Junczys-Dowmunt, M., Pouliquen, B.: The united nations parallel corpus v1.0. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA), Portorož, Slovenia (2016)
Acknowledgments
The ADAPT Centre for Digital Content Technology is funded under the Science Foundation Ireland (SFI) Research Centres Programme (Grant No. 13/RC/2106) and is co-funded under the European Regional Development Fund.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 Springer Nature Switzerland AG
About this paper
Cite this paper
Haque, R., Hasanuzzaman, M., Way, A. (2023). Evaluating Terminology Translation in MT. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13451. Springer, Cham. https://doi.org/10.1007/978-3-031-24337-0_35
Download citation
DOI: https://doi.org/10.1007/978-3-031-24337-0_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24336-3
Online ISBN: 978-3-031-24337-0
eBook Packages: Computer ScienceComputer Science (R0)