research-article

Machine Translation Testing via Syntactic Tree Pruning

Authors:

Qingyu WangAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology, Volume 33, Issue 5

Article No.: 125, Pages 1 - 39

https://doi.org/10.1145/3640329

Published: 04 June 2024 Publication History

Abstract

Machine translation systems have been widely adopted in our daily life, making life easier and more convenient. Unfortunately, erroneous translations may result in severe consequences, such as financial losses. This requires to improve the accuracy and the reliability of machine translation systems. However, it is challenging to test machine translation systems because of the complexity and intractability of the underlying neural models. To tackle these challenges, we propose a novel metamorphic testing approach by syntactic tree pruning (STP) to validate machine translation systems. Our key insight is that a pruned sentence should have similar crucial semantics compared with the original sentence. Specifically, STP (1) proposes a core semantics-preserving pruning strategy by basic sentence structures and dependency relations on the level of syntactic tree representation, (2) generates source sentence pairs based on the metamorphic relation, and (3) reports suspicious issues whose translations break the consistency property by a bag-of-words model. We further evaluate STP on two state-of-the-art machine translation systems (i.e., Google Translate and Bing Microsoft Translator) with 1,200 source sentences as inputs. The results show that STP accurately finds 5,073 unique erroneous translations in Google Translate and 5,100 unique erroneous translations in Bing Microsoft Translator (400% more than state-of-the-art techniques), with 64.5% and 65.4% precision, respectively. The reported erroneous translations vary in types and more than 90% of them are not found by state-of-the-art techniques. There are 9,393 erroneous translations unique to STP, which is 711.9% more than state-of-the-art techniques. Moreover, STP is quite effective in detecting translation errors for the original sentences with a recall reaching 74.0%, improving state-of-the-art techniques by 55.1% on average.

References

[1]

BBC. 2022. The British Broadcasting Corporation (BBC) News Homepage. Retrieved from https://www.bbc.com/(accessed August, 2022).

[2]

Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. In Proceedings of the 6th International Conference on Learning Representations (ICLR’18). 1–13.

[3]

Jialun Cao, Meiziniu Li, Yeting Li, Ming Wen, Shing-Chi Cheung, and Haiming Chen. 2022. SemMT: A semantic-based testing approach for machine translation systems. ACM Trans. Softw. Eng. Methodol. 31, 2 (2022), 1–36.

Digital Library

[4]

Danqi Chen and Wen-tau Yih. 2020. Open-domain question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 34–37.

[5]

Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. 2021. Testing your question answering software via asking recursively. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE’21). 104–116.

Digital Library

[6]

Tsong Y. Chen, Shing C. Cheung, and Shiu Ming Yiu. 2020. Metamorphic testing: A new approach for generating next test cases. Retrieved from https://arXiv:2002.12543

[7]

Tsong Yueh Chen, Fei-Ching Kuo, Huai Liu, Pak-Lok Poon, Dave Towey, T. H. Tse, and Zhi Quan Zhou. 2018. Metamorphic testing: A review of challenges and opportunities. ACM Comput. Surveys 51, 1 (2018), 1–27.

Digital Library

[8]

Yong Cheng, Lu Jiang, and Wolfgang Macherey. 2019. Robust neural machine translation with doubly adversarial inputs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 4324–4333.

[9]

Noam Chomsky. 2002. Syntactic Structures. Walter de Gruyter.

[10]

CNN. 2022. The Cable News Network (CNN) News Homepage. Retrieved from https://edition.cnn.com/(accessed August, 2022).

[11]

China Daily. 2022. China Daily News Homepage. Retrieved from https://www.chinadaily.com.cn/

[12]

IBM Cloud Docs. 2016. Machine Translation Tips. Retrieved from https://cloud.ibm.com/docs/GlobalizationPipeline?topic=GlobalizationPipeline-globalizationpipeline_tips&locale=en(accessed August, 2022).

[13]

Yinpeng Dong, Qi-An Fu, Xiao Yang, Tianyu Pang, Hang Su, Zihao Xiao, and Jun Zhu. 2019. Benchmarking adversarial robustness. Retrieved from https://arXiv:1912.11852

[14]

Xiaoning Du, Xiaofei Xie, Yi Li, Lei Ma, Yang Liu, and Jianjun Zhao. 2019. Deepstellar: Model-based quantitative analysis of stateful deep learning systems. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’19). 477–487.

Digital Library

[15]

Javid Ebrahimi, Daniel Lowd, and Dejing Dou. 2018. On adversarial examples for character-level neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics (COLING’18). 653–663.

[16]

Google. 2022. Google Translate. Retrieved from https://translate.google.com(accessed August, 2022).

[17]

Stanford NLP Group. 2022. CoreNLP. Retrieved from https://stanfordnlp.github.io/CoreNLP(accessed August, 2022).

[18]

Shashij Gupta, Pinjia He, Clara Meister, and Zhendong Su. 2020. Machine translation testing via pathological invariance. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’20). 863–875.

[19]

Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li et al. 2018. Achieving human parity on automatic chinese to english news translation. Retrieved from https://arXiv:1803.05567

[20]

Pinjia He. 2022. Machine Translation Testing Toolkit. Retrieved from https://github.com/RobustNLP/TestTranslation(accessed August, 2022).

[21]

Pinjia He, Clara Meister, and Zhendong Su. 2020. Structure-invariant testing for machine translation. In Proceedings of the 42nd IEEE/ACM International Conference on Software Engineering (ICSE’20). 961–973.

Digital Library

[22]

Pinjia He, Clara Meister, and Zhendong Su. 2021. Testing machine translation via referential transparency. In Proceedings of the 43nd IEEE/ACM International Conference on Software Engineering (ICSE’21). 961–973.

Digital Library

[23]

Jen-tse Huang, Jianping Zhang, Wenxuan Wang, Pinjia He, Yuxin Su, and Michael R. Lyu. 2022. AEON: A method for automatic evaluation of NLP test cases. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’22). 202–214.

Digital Library

[24]

Rodney Huddleston. 1984. Introduction to the Grammar of English. Cambridge University Press.

[25]

Pin Ji, Yang Feng, Jia Liu, Zhihong Zhao, and Baowen Xu. 2021. Automated testing for machine translation via constituency invariance. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE’21). 468–479.

Digital Library

[26]

Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 2021–2031.

[27]

Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding deep learning system testing using surprise adequacy. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering (ICSE’19). IEEE, 1039–1049.

Digital Library

[28]

Vu Le, Mehrdad Afshari, and Zhendong Su. 2014. Compiler validation via equivalence modulo inputs. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’14). 216–226.

Digital Library

[29]

Shaohua Li and Zhendong Su. 2023. Accelerating fuzzing through prefix-guided execution. Proc. ACM Program. Lang. 7, OOPSLA1 (2023), 1–27.

Digital Library

[30]

Zuchao Li, Rui Wang, Kehai Chen, Masao Utiyama, Eiichiro Sumita, Zhuosheng Zhang, and Hai Zhao. 2020. Explicit sentence compression for neural machine translation. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI’20), Vol. 34. 8311–8318.

[31]

Christopher Lidbury, Andrei Lascu, Nathan Chong, and Alastair F. Donaldson. 2015. Many-core compiler fuzzing. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’15). 65–76.

Digital Library

[32]

Ji Lin, Chuang Gan, and Song Han. 2018. Defensive quantization: When efficiency meets robustness. In Proceedings of the International Conference on Learning Representations.

[33]

Mikael Lindvall, Dharmalingam Ganesan, Ragnar Árdal, and Robert E. Wiegand. 2015. Metamorphic model-based testing applied on NASA DAT–an experience report. In Proceedings of the 37th IEEE/ACM International Conference on Software Engineering (ICSE’15), Vol. 2. 129–138.

[34]

Lingua. 2022. The 20 Most Spoken Languages in the World in 2022. Retrieved from https://lingua.edu/the-20-most-spoken-languages-in-the-world-in-2022/(accessed August, 2022).

[35]

Qian Liu, Bei Chen, Jian-Guang Lou, Bin Zhou, and Dongmei Zhang. 2020. Incomplete utterance rewriting as semantic segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 2846–2857.

[36]

John Lyons and Lyons John. 1995. Linguistic Semantics: An Introduction. Cambridge University Press.

[37]

Shiqing Ma, Yingqi Liu, Wen-Chuan Lee, Xiangyu Zhang, and Ananth Grama. 2018. MODE: Automated neural network model debugging via state differential analysis and input selection. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’18). 175–186.

Digital Library

[38]

William C. Mann and Sandra A. Thompson. 1988. Rhetorical structure theory: Toward a functional theory of text organization. Text-interdisc. J. Study Disc. 8, 3 (1988), 243–281.

[39]

Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (ACL’14). 55–60.

[40]

Microsoft. 2022. Bing Microsoft Translator. Retrieved from https://www.bing.com/translator(accessed August, 2022).

[41]

Pramod Kaushik Mudrakarta, Ankur Taly, Mukund Sundararajan, and Kedar Dhamdhere. 2018. Did the model understand the question? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18). 1896–1906.

[42]

Christian Murphy, Gail E. Kaiser, Lifeng Hu, and Leon Wu. 2008. Properties of machine learning applications for use in metamorphic testing. In Proceedings of the 20th International Conference on Software Engineering and Knowledge Engineering (SEKE’08). 867–872.

[43]

Christina Niklaus, Matthias Cetto, André Freitas, and Siegfried Handschuh. 2019. Transforming complex sentences into a semantic hierarchy. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 3415–3427.

[44]

Myle Ott, Michael Auli, David Grangier, and Marc’Aurelio Ranzato. 2018. Analyzing uncertainty in neural machine translation. In Proceedings of the 35th International Conference on Machine Learning (ICML’18). 3956–3965.

[45]

Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. 2016. Distillation as a defense to adversarial perturbations against deep neural networks. In Proceedings of the IEEE Symposium on Security and Privacy (SP’16). IEEE, 582–597.

[46]

Daniel Pesu, Zhi Quan Zhou, Jingfeng Zhen, and Dave Towey. 2018. A monte carlo method for metamorphic testing of machine translation services. In Proceedings of the IEEE/ACM 3rd International Workshop on Metamorphic Testing (MET’18). IEEE, 38–45.

Digital Library

[47]

Randolph Quirk. 2010. A Comprehensive Grammar of the English Language. Pearson Education India.

[48]

Reuters. 2022. Reuters News Homepage. Retrieved from https://www.reuters.com/(accessed August, 2022).

[49]

Sergio Segura, Gordon Fraser, Ana B. Sanchez, and Antonio Ruiz-Cortés. 2016. A survey on metamorphic testing. IEEE Trans. Softw. Eng. 42, 9 (2016), 805–824.

[50]

Qingchao Shen, Junjie Chen, Jie Zhang, Haoyu Wang, Shuang Liu, and Menghan Tian. 2022. Natural test generation for precise testing of question answering software. In Proceedings of the IEEE/ACM Conference on Automated Software Engineering (ASE’22).

Digital Library

[51]

Jieke Shi, Zhou Yang, Bowen Xu, Hong Jin Kang, and David Lo. 2022. Compressing pre-trained models of code into 3 MB. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE’22). 1–12.

Digital Library

[52]

Punardeep Sikka, Manmeet Singh, Allen Pink, and Vijay Mago. 2020. A survey on text simplification. Retrieved from https://arXiv:2008.08612

[53]

Liqun Sun and Zhi Quan Zhou. 2018. Metamorphic testing for machine translations: MT4MT. In Proceedings of the 25th Australasian Software Engineering Conference (ASWEC’18). IEEE, 96–100.

[54]

Zeyu Sun, Jie M. Zhang, Mark Harman, Mike Papadakis, and Lu Zhang. 2020. Automatic testing and improvement of machine translation. In Proceedings of the 42nd IEEE/ACM International Conference on Software Engineering (ICSE’20). 974–985.

Digital Library

[55]

Zeyu Sun, Jie M. Zhang, Yingfei Xiong, Mark Harman, Mike Papadakis, and Lu Zhang. 2022. Improving machine translation systems via isotopic replacement. In Proceedings of the 44th IEEE/ACM International Conference on Software Engineering (ICSE’22).

Digital Library

[56]

Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. DeepTest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th IEEE/ACM International Conference on Software Engineering (ICSE’18). 303–314.

Digital Library

[57]

Barak Turovsky. 2016. Ten Years of Google Translate. Retrieved from https://blog.google/products/translate/ten-years-of-google-translate/

[58]

Deze Wang, Zhouyang Jia, Shanshan Li, Yue Yu, Yun Xiong, Wei Dong, and Xiangke Liao. 2022. Bridging pre-trained models and downstream tasks for source code understanding. In Proceedings of the 44th IEEE/ACM International Conference on Software Engineering (ICSE’22). 287–298.

Digital Library

[59]

Jingyi Wang, Guoliang Dong, Jun Sun, Xinyu Wang, and Peixin Zhang. 2019. Adversarial sample detection for deep neural network through model mutation testing. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering (ICSE’19). IEEE, 1245–1256.

Digital Library

[60]

Wenyu Wang, Wujie Zheng, Dian Liu, Changrong Zhang, Qinsong Zeng, Yuetang Deng, Wei Yang, Pinjia He, and Tao Xie. 2019. Detecting failures of neural machine translation in the absence of reference translations. In Proceedings of the 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’19). 1–4.

[61]

Lei Wu, Steven C. H. Hoi, and Nenghai Yu. 2010. Semantics-preserving bag-of-words models and applications. IEEE Trans. Image Process. 19, 7 (2010), 1908–1920.

Digital Library

[62]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. Retrieved from https://arXiv:1609.08144

[63]

Chong Xiang, Charles R. Qi, and Bo Li. 2019. Generating 3d adversarial point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 9136–9144.

[64]

Chaowei Xiao, Dawei Yang, Bo Li, Jia Deng, and Mingyan Liu. 2019. Meshadv: Adversarial meshes for visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 6898–6907.

[65]

Xiaoyuan Xie, Joshua W. K. Ho, Christian Murphy, Gail Kaiser, Baowen Xu, and Tsong Yueh Chen. 2011. Testing and validating machine learning classifiers by metamorphic testing. J. Syst. Softw. 84, 4 (2011), 544–558.

Digital Library

[66]

Jiacheng Xu, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Discourse-aware neural extractive text summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’18). 5021–5031.

[67]

Youdao. 2022. Youdao Translator. Retrieved from http://www.youdao.com(accessed August, 2022).

[68]

Boxi Yu, Zhiqing Zhong, Xinran Qin, Jiayi Yao, Yuancheng Wang, and Pinjia He. 2022. Automated testing of image captioning systems. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’22). 467–479.

Digital Library

[69]

Fuyuan Zhang, Sankalan Pal Chowdhury, and Maria Christakis. 2019. DeepSearch: Simple and effective blackbox fuzzing of deep neural networks. Retrieved from https://arXiv:1910.06296

[70]

Jie Zhang, Junjie Chen, Dan Hao, Yingfei Xiong, Bing Xie, Lu Zhang, and Hong Mei. 2014. Search-based inference of polynomial metamorphic relations. In Proceedings of the 29th IEEE/ACM International Conference on Automated Software Engineering (ASE’14). 701–712.

Digital Library

[71]

Mengshi Zhang, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz Khurshid. 2018. Deeproad: Gan-based metamorphic testing and input validation framework for autonomous driving systems. In Proceedings of the 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE’18). 132–142.

Digital Library

[72]

Quanjun Zhang and Haichuan Hu. 2023. STP Reproduction Artifacts. Retrieved from https://github.com/iSEngLab/STP(accessed December, 2023).

[73]

Xinze Zhang, Junzhe Zhang, Zhenhua Chen, and Kun He. 2021. Crafting adversarial examples for neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL’21). 1967–1977.

[74]

Yuhao Zhang, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang. 2018. An empirical study on tensorflow program bugs. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’18). 129–140.

Digital Library

[75]

Yuhao Zhang, Peng Qi, and Christopher D. Manning. 2018. Graph convolution over pruned dependency trees improves relation extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 2205–2215.

[76]

Wujie Zheng, Wenyu Wang, Dian Liu, Changrong Zhang, Qinsong Zeng, Yuetang Deng, Wei Yang, Pinjia He, and Tao Xie. 2019. Testing untestable neural machine translation: An industrial case. In Proceedings of the 41st IEEE/ACM International Conference on Software Engineering (ICSE’19). 314–315.

Digital Library

[77]

Zhi Quan Zhou, Shaowen Xiang, and Tsong Yueh Chen. 2015. Metamorphic testing for software quality assessment: A study of search engines. IEEE Trans. Softw. Eng. 42, 3 (2015), 264–284.

Digital Library

[78]

Zhi Quan Zhou, ShuJia Zhang, Markus Hagenbuchner, T. H. Tse, Fei-Ching Kuo, and Tsong Yueh Chen. 2012. Automated functional testing of online search services. Softw. Test. Verific. Reliab. 22, 4 (2012), 221–243.

Digital Library

[79]

Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. 2013. Fast and accurate shift-reduce constituent parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL’13). 434–443.

Cited By

Index Terms

Machine Translation Testing via Syntactic Tree Pruning
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Evaluating Terminology Translation in Machine Translation Systems via Metamorphic Testing
ASE '24: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering

Machine translation has become an integral part of daily life, with terminology translation playing a crucial role in ensuring the accuracy of translation results. However, existing translation systems, such as Google Translate, have been shown to ...
Structure-invariant testing for machine translation
ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering

In recent years, machine translation software has increasingly been integrated into our daily lives. People routinely use machine translation for various applications, such as describing symptoms to a foreign doctor and reading political news in a ...
Language Modeling for Syntax-Based Machine Translation Using Tree Substitution Grammars: A Case Study on Chinese-English Translation

The poor grammatical output of Machine Translation (MT) systems appeals syntax-based approaches within language modeling. However, previous studies showed that syntax-based language modeling using (Context-Free) Treebank Grammars was not very helpful in ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology

ACM Transactions on Software Engineering and Methodology Volume 33, Issue 5

June 2024

952 pages

EISSN:1557-7392

DOI:10.1145/3618079

Editor:
Mauro Pezzè
USI Università della Svizzera italiana and SIT Schaffhausen Institute of Technology, Switzerland

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 June 2024

Online AM: 10 January 2024

Accepted: 31 December 2023

Revised: 26 October 2023

Received: 31 August 2022

Published in TOSEM Volume 33, Issue 5

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
385
Total Downloads

Downloads (Last 12 months)340
Downloads (Last 6 weeks)28

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents