skip to main content
research-article

Machine Translation Testing via Syntactic Tree Pruning

Published: 04 June 2024 Publication History

Abstract

Machine translation systems have been widely adopted in our daily life, making life easier and more convenient. Unfortunately, erroneous translations may result in severe consequences, such as financial losses. This requires to improve the accuracy and the reliability of machine translation systems. However, it is challenging to test machine translation systems because of the complexity and intractability of the underlying neural models. To tackle these challenges, we propose a novel metamorphic testing approach by syntactic tree pruning (STP) to validate machine translation systems. Our key insight is that a pruned sentence should have similar crucial semantics compared with the original sentence. Specifically, STP (1) proposes a core semantics-preserving pruning strategy by basic sentence structures and dependency relations on the level of syntactic tree representation, (2) generates source sentence pairs based on the metamorphic relation, and (3) reports suspicious issues whose translations break the consistency property by a bag-of-words model. We further evaluate STP on two state-of-the-art machine translation systems (i.e., Google Translate and Bing Microsoft Translator) with 1,200 source sentences as inputs. The results show that STP accurately finds 5,073 unique erroneous translations in Google Translate and 5,100 unique erroneous translations in Bing Microsoft Translator (400% more than state-of-the-art techniques), with 64.5% and 65.4% precision, respectively. The reported erroneous translations vary in types and more than 90% of them are not found by state-of-the-art techniques. There are 9,393 erroneous translations unique to STP, which is 711.9% more than state-of-the-art techniques. Moreover, STP is quite effective in detecting translation errors for the original sentences with a recall reaching 74.0%, improving state-of-the-art techniques by 55.1% on average.

References

[1]
BBC. 2022. The British Broadcasting Corporation (BBC) News Homepage. Retrieved from https://www.bbc.com/(accessed August, 2022).
[2]
Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. In Proceedings of the 6th International Conference on Learning Representations (ICLR’18). 1–13.
[3]
Jialun Cao, Meiziniu Li, Yeting Li, Ming Wen, Shing-Chi Cheung, and Haiming Chen. 2022. SemMT: A semantic-based testing approach for machine translation systems. ACM Trans. Softw. Eng. Methodol. 31, 2 (2022), 1–36.
[4]
Danqi Chen and Wen-tau Yih. 2020. Open-domain question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20). 34–37.
[5]
Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. 2021. Testing your question answering software via asking recursively. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE’21). 104–116.
[6]
Tsong Y. Chen, Shing C. Cheung, and Shiu Ming Yiu. 2020. Metamorphic testing: A new approach for generating next test cases. Retrieved from https://arXiv:2002.12543
[7]
Tsong Yueh Chen, Fei-Ching Kuo, Huai Liu, Pak-Lok Poon, Dave Towey, T. H. Tse, and Zhi Quan Zhou. 2018. Metamorphic testing: A review of challenges and opportunities. ACM Comput. Surveys 51, 1 (2018), 1–27.
[8]
Yong Cheng, Lu Jiang, and Wolfgang Macherey. 2019. Robust neural machine translation with doubly adversarial inputs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 4324–4333.
[9]
Noam Chomsky. 2002. Syntactic Structures. Walter de Gruyter.
[10]
CNN. 2022. The Cable News Network (CNN) News Homepage. Retrieved from https://edition.cnn.com/(accessed August, 2022).
[11]
China Daily. 2022. China Daily News Homepage. Retrieved from https://www.chinadaily.com.cn/
[12]
IBM Cloud Docs. 2016. Machine Translation Tips. Retrieved from https://cloud.ibm.com/docs/GlobalizationPipeline?topic=GlobalizationPipeline-globalizationpipeline_tips&locale=en(accessed August, 2022).
[13]
Yinpeng Dong, Qi-An Fu, Xiao Yang, Tianyu Pang, Hang Su, Zihao Xiao, and Jun Zhu. 2019. Benchmarking adversarial robustness. Retrieved from https://arXiv:1912.11852
[14]
Xiaoning Du, Xiaofei Xie, Yi Li, Lei Ma, Yang Liu, and Jianjun Zhao. 2019. Deepstellar: Model-based quantitative analysis of stateful deep learning systems. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’19). 477–487.
[15]
Javid Ebrahimi, Daniel Lowd, and Dejing Dou. 2018. On adversarial examples for character-level neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics (COLING’18). 653–663.
[16]
Google. 2022. Google Translate. Retrieved from https://translate.google.com(accessed August, 2022).
[17]
Stanford NLP Group. 2022. CoreNLP. Retrieved from https://stanfordnlp.github.io/CoreNLP(accessed August, 2022).
[18]
Shashij Gupta, Pinjia He, Clara Meister, and Zhendong Su. 2020. Machine translation testing via pathological invariance. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’20). 863–875.
[19]
Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li et al. 2018. Achieving human parity on automatic chinese to english news translation. Retrieved from https://arXiv:1803.05567
[20]
Pinjia He. 2022. Machine Translation Testing Toolkit. Retrieved from https://github.com/RobustNLP/TestTranslation(accessed August, 2022).
[21]
Pinjia He, Clara Meister, and Zhendong Su. 2020. Structure-invariant testing for machine translation. In Proceedings of the 42nd IEEE/ACM International Conference on Software Engineering (ICSE’20). 961–973.
[22]
Pinjia He, Clara Meister, and Zhendong Su. 2021. Testing machine translation via referential transparency. In Proceedings of the 43nd IEEE/ACM International Conference on Software Engineering (ICSE’21). 961–973.
[23]
Jen-tse Huang, Jianping Zhang, Wenxuan Wang, Pinjia He, Yuxin Su, and Michael R. Lyu. 2022. AEON: A method for automatic evaluation of NLP test cases. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’22). 202–214.
[24]
Rodney Huddleston. 1984. Introduction to the Grammar of English. Cambridge University Press.
[25]
Pin Ji, Yang Feng, Jia Liu, Zhihong Zhao, and Baowen Xu. 2021. Automated testing for machine translation via constituency invariance. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE’21). 468–479.
[26]
Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 2021–2031.
[27]
Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding deep learning system testing using surprise adequacy. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering (ICSE’19). IEEE, 1039–1049.
[28]
Vu Le, Mehrdad Afshari, and Zhendong Su. 2014. Compiler validation via equivalence modulo inputs. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’14). 216–226.
[29]
Shaohua Li and Zhendong Su. 2023. Accelerating fuzzing through prefix-guided execution. Proc. ACM Program. Lang. 7, OOPSLA1 (2023), 1–27.
[30]
Zuchao Li, Rui Wang, Kehai Chen, Masao Utiyama, Eiichiro Sumita, Zhuosheng Zhang, and Hai Zhao. 2020. Explicit sentence compression for neural machine translation. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI’20), Vol. 34. 8311–8318.
[31]
Christopher Lidbury, Andrei Lascu, Nathan Chong, and Alastair F. Donaldson. 2015. Many-core compiler fuzzing. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’15). 65–76.
[32]
Ji Lin, Chuang Gan, and Song Han. 2018. Defensive quantization: When efficiency meets robustness. In Proceedings of the International Conference on Learning Representations.
[33]
Mikael Lindvall, Dharmalingam Ganesan, Ragnar Árdal, and Robert E. Wiegand. 2015. Metamorphic model-based testing applied on NASA DAT–an experience report. In Proceedings of the 37th IEEE/ACM International Conference on Software Engineering (ICSE’15), Vol. 2. 129–138.
[34]
Lingua. 2022. The 20 Most Spoken Languages in the World in 2022. Retrieved from https://lingua.edu/the-20-most-spoken-languages-in-the-world-in-2022/(accessed August, 2022).
[35]
Qian Liu, Bei Chen, Jian-Guang Lou, Bin Zhou, and Dongmei Zhang. 2020. Incomplete utterance rewriting as semantic segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 2846–2857.
[36]
John Lyons and Lyons John. 1995. Linguistic Semantics: An Introduction. Cambridge University Press.
[37]
Shiqing Ma, Yingqi Liu, Wen-Chuan Lee, Xiangyu Zhang, and Ananth Grama. 2018. MODE: Automated neural network model debugging via state differential analysis and input selection. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’18). 175–186.
[38]
William C. Mann and Sandra A. Thompson. 1988. Rhetorical structure theory: Toward a functional theory of text organization. Text-interdisc. J. Study Disc. 8, 3 (1988), 243–281.
[39]
Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (ACL’14). 55–60.
[40]
Microsoft. 2022. Bing Microsoft Translator. Retrieved from https://www.bing.com/translator(accessed August, 2022).
[41]
Pramod Kaushik Mudrakarta, Ankur Taly, Mukund Sundararajan, and Kedar Dhamdhere. 2018. Did the model understand the question? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18). 1896–1906.
[42]
Christian Murphy, Gail E. Kaiser, Lifeng Hu, and Leon Wu. 2008. Properties of machine learning applications for use in metamorphic testing. In Proceedings of the 20th International Conference on Software Engineering and Knowledge Engineering (SEKE’08). 867–872.
[43]
Christina Niklaus, Matthias Cetto, André Freitas, and Siegfried Handschuh. 2019. Transforming complex sentences into a semantic hierarchy. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 3415–3427.
[44]
Myle Ott, Michael Auli, David Grangier, and Marc’Aurelio Ranzato. 2018. Analyzing uncertainty in neural machine translation. In Proceedings of the 35th International Conference on Machine Learning (ICML’18). 3956–3965.
[45]
Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. 2016. Distillation as a defense to adversarial perturbations against deep neural networks. In Proceedings of the IEEE Symposium on Security and Privacy (SP’16). IEEE, 582–597.
[46]
Daniel Pesu, Zhi Quan Zhou, Jingfeng Zhen, and Dave Towey. 2018. A monte carlo method for metamorphic testing of machine translation services. In Proceedings of the IEEE/ACM 3rd International Workshop on Metamorphic Testing (MET’18). IEEE, 38–45.
[47]
Randolph Quirk. 2010. A Comprehensive Grammar of the English Language. Pearson Education India.
[48]
Reuters. 2022. Reuters News Homepage. Retrieved from https://www.reuters.com/(accessed August, 2022).
[49]
Sergio Segura, Gordon Fraser, Ana B. Sanchez, and Antonio Ruiz-Cortés. 2016. A survey on metamorphic testing. IEEE Trans. Softw. Eng. 42, 9 (2016), 805–824.
[50]
Qingchao Shen, Junjie Chen, Jie Zhang, Haoyu Wang, Shuang Liu, and Menghan Tian. 2022. Natural test generation for precise testing of question answering software. In Proceedings of the IEEE/ACM Conference on Automated Software Engineering (ASE’22).
[51]
Jieke Shi, Zhou Yang, Bowen Xu, Hong Jin Kang, and David Lo. 2022. Compressing pre-trained models of code into 3 MB. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE’22). 1–12.
[52]
Punardeep Sikka, Manmeet Singh, Allen Pink, and Vijay Mago. 2020. A survey on text simplification. Retrieved from https://arXiv:2008.08612
[53]
Liqun Sun and Zhi Quan Zhou. 2018. Metamorphic testing for machine translations: MT4MT. In Proceedings of the 25th Australasian Software Engineering Conference (ASWEC’18). IEEE, 96–100.
[54]
Zeyu Sun, Jie M. Zhang, Mark Harman, Mike Papadakis, and Lu Zhang. 2020. Automatic testing and improvement of machine translation. In Proceedings of the 42nd IEEE/ACM International Conference on Software Engineering (ICSE’20). 974–985.
[55]
Zeyu Sun, Jie M. Zhang, Yingfei Xiong, Mark Harman, Mike Papadakis, and Lu Zhang. 2022. Improving machine translation systems via isotopic replacement. In Proceedings of the 44th IEEE/ACM International Conference on Software Engineering (ICSE’22).
[56]
Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. DeepTest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th IEEE/ACM International Conference on Software Engineering (ICSE’18). 303–314.
[57]
Barak Turovsky. 2016. Ten Years of Google Translate. Retrieved from https://blog.google/products/translate/ten-years-of-google-translate/
[58]
Deze Wang, Zhouyang Jia, Shanshan Li, Yue Yu, Yun Xiong, Wei Dong, and Xiangke Liao. 2022. Bridging pre-trained models and downstream tasks for source code understanding. In Proceedings of the 44th IEEE/ACM International Conference on Software Engineering (ICSE’22). 287–298.
[59]
Jingyi Wang, Guoliang Dong, Jun Sun, Xinyu Wang, and Peixin Zhang. 2019. Adversarial sample detection for deep neural network through model mutation testing. In Proceedings of the IEEE/ACM 41st International Conference on Software Engineering (ICSE’19). IEEE, 1245–1256.
[60]
Wenyu Wang, Wujie Zheng, Dian Liu, Changrong Zhang, Qinsong Zeng, Yuetang Deng, Wei Yang, Pinjia He, and Tao Xie. 2019. Detecting failures of neural machine translation in the absence of reference translations. In Proceedings of the 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’19). 1–4.
[61]
Lei Wu, Steven C. H. Hoi, and Nenghai Yu. 2010. Semantics-preserving bag-of-words models and applications. IEEE Trans. Image Process. 19, 7 (2010), 1908–1920.
[62]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. Retrieved from https://arXiv:1609.08144
[63]
Chong Xiang, Charles R. Qi, and Bo Li. 2019. Generating 3d adversarial point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 9136–9144.
[64]
Chaowei Xiao, Dawei Yang, Bo Li, Jia Deng, and Mingyan Liu. 2019. Meshadv: Adversarial meshes for visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 6898–6907.
[65]
Xiaoyuan Xie, Joshua W. K. Ho, Christian Murphy, Gail Kaiser, Baowen Xu, and Tsong Yueh Chen. 2011. Testing and validating machine learning classifiers by metamorphic testing. J. Syst. Softw. 84, 4 (2011), 544–558.
[66]
Jiacheng Xu, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Discourse-aware neural extractive text summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’18). 5021–5031.
[67]
Youdao. 2022. Youdao Translator. Retrieved from http://www.youdao.com(accessed August, 2022).
[68]
Boxi Yu, Zhiqing Zhong, Xinran Qin, Jiayi Yao, Yuancheng Wang, and Pinjia He. 2022. Automated testing of image captioning systems. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’22). 467–479.
[69]
Fuyuan Zhang, Sankalan Pal Chowdhury, and Maria Christakis. 2019. DeepSearch: Simple and effective blackbox fuzzing of deep neural networks. Retrieved from https://arXiv:1910.06296
[70]
Jie Zhang, Junjie Chen, Dan Hao, Yingfei Xiong, Bing Xie, Lu Zhang, and Hong Mei. 2014. Search-based inference of polynomial metamorphic relations. In Proceedings of the 29th IEEE/ACM International Conference on Automated Software Engineering (ASE’14). 701–712.
[71]
Mengshi Zhang, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz Khurshid. 2018. Deeproad: Gan-based metamorphic testing and input validation framework for autonomous driving systems. In Proceedings of the 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE’18). 132–142.
[72]
Quanjun Zhang and Haichuan Hu. 2023. STP Reproduction Artifacts. Retrieved from https://github.com/iSEngLab/STP(accessed December, 2023).
[73]
Xinze Zhang, Junzhe Zhang, Zhenhua Chen, and Kun He. 2021. Crafting adversarial examples for neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL’21). 1967–1977.
[74]
Yuhao Zhang, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang. 2018. An empirical study on tensorflow program bugs. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’18). 129–140.
[75]
Yuhao Zhang, Peng Qi, and Christopher D. Manning. 2018. Graph convolution over pruned dependency trees improves relation extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18). 2205–2215.
[76]
Wujie Zheng, Wenyu Wang, Dian Liu, Changrong Zhang, Qinsong Zeng, Yuetang Deng, Wei Yang, Pinjia He, and Tao Xie. 2019. Testing untestable neural machine translation: An industrial case. In Proceedings of the 41st IEEE/ACM International Conference on Software Engineering (ICSE’19). 314–315.
[77]
Zhi Quan Zhou, Shaowen Xiang, and Tsong Yueh Chen. 2015. Metamorphic testing for software quality assessment: A study of search engines. IEEE Trans. Softw. Eng. 42, 3 (2015), 264–284.
[78]
Zhi Quan Zhou, ShuJia Zhang, Markus Hagenbuchner, T. H. Tse, Fei-Ching Kuo, and Tsong Yueh Chen. 2012. Automated functional testing of online search services. Softw. Test. Verific. Reliab. 22, 4 (2012), 221–243.
[79]
Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. 2013. Fast and accurate shift-reduce constituent parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL’13). 434–443.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Software Engineering and Methodology
ACM Transactions on Software Engineering and Methodology  Volume 33, Issue 5
June 2024
952 pages
EISSN:1557-7392
DOI:10.1145/3618079
  • Editor:
  • Mauro Pezzè
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 June 2024
Online AM: 10 January 2024
Accepted: 31 December 2023
Revised: 26 October 2023
Received: 31 August 2022
Published in TOSEM Volume 33, Issue 5

Check for updates

Author Tags

  1. Software testing
  2. machine translation
  3. metamorphic testing

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 385
    Total Downloads
  • Downloads (Last 12 months)340
  • Downloads (Last 6 weeks)28
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media