research-article

Testing Coreference Resolution Systems without Labeled Test Sets

Authors:
Jialun Cao

Hong Kong University of Science and Technology, Hong Kong, China / Guangzhou HKUST Fok Ying Tung Research Institute, Guangzhou, China

Hong Kong University of Science and Technology, Hong Kong, China / Guangzhou HKUST Fok Ying Tung Research Institute, Guangzhou, China
View Profile

,
Yaojie Lu

Chinese Institute of Software at Chinese Academy of Sciences, Beijing, China

Chinese Institute of Software at Chinese Academy of Sciences, Beijing, China
View Profile

,
Ming Wen

Huazhong University of Science and Technology, Hubei, China

Huazhong University of Science and Technology, Hubei, China
View Profile

,
Shing-Chi Cheung

Hong Kong University of Science and Technology, Hong Kong, China / Guangzhou HKUST Fok Ying Tung Research Institute, Hong Kong, China

Hong Kong University of Science and Technology, Hong Kong, China / Guangzhou HKUST Fok Ying Tung Research Institute, Hong Kong, China
View Profile

ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software EngineeringNovember 2023Pages 107–119https://doi.org/10.1145/3611643.3616258

Published:30 November 2023Publication History

ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Pages 107–119

ABSTRACT

Coreference resolution (CR) is a task to resolve different expressions (e.g., named entities, pronouns) that refer to the same real-world en- tity/event. It is a core natural language processing (NLP) component that underlies and empowers major downstream NLP applications such as machine translation, chatbots, and question-answering. De- spite its broad impact, the problem of testing CR systems has rarely been studied. A major difficulty is the shortage of a labeled dataset for testing. While it is possible to feed arbitrary sentences as test inputs to a CR system, a test oracle that captures their expected test outputs (coreference relations) is hard to define automatically. To address the challenge, we propose Crest, an automated testing methodology for CR systems. Crest uses constituency and depen- dency relations to construct pairs of test inputs subject to the same coreference. These relations can be leveraged to define the meta- morphic relation for metamorphic testing. We compare Crest with five state-of-the-art test generation baselines on two popular CR systems, and apply them to generate tests from 1,000 sentences randomly sampled from CoNLL-2012, a popular dataset for corefer- ence resolution. Experimental results show that Crest outperforms baselines significantly. The issues reported by Crest are all true positives (i.e., 100% precision), compared with 63% to 75% achieved by the baselines.

References

Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages, 3, POPL (2019), 1–29. Google ScholarDigital Library
Anonymous. 2023. CREST. https://anonymous.4open.science/r/Crest_FSE2023/ Google Scholar
Rahul Aralikatte, Heather Lent, Ana Valeria Gonzalez, Daniel Herschcovich, Chen Qiu, Anders Sandholm, Michael Ringaard, and Anders Søgaard. 2019. Rewarding Coreference Resolvers for Being Consistent with World Knowledge. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China. 1229–1235. https://doi.org/10.18653/v1/D19-1118 Google ScholarCross Ref
Saliha Azzam, Kevin Humphreys, and Robert Gaizauskas. 1999. Using Coreference Chains for Text Summarization. In Proceedings of the Workshop on Coreference and Its Applications (CorefApp ’99). Association for Computational Linguistics, College Park, Maryland. 77–84. Google ScholarDigital Library
Eric Bengtson and Dan Roth. 2008. Understanding the Value of Features for Coreference Resolution. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Honolulu, Hawaii. 294–303. https://aclanthology.org/D08-1031 Google ScholarCross Ref
Jialun Cao, Meiziniu Li, Yeting Li, Ming Wen, Shing-Chi Cheung, and Haiming Chen. 2022. SemMT: a semantic-based testing approach for machine translation systems. ACM Transactions on Software Engineering and Methodology (TOSEM), 31, 2 (2022), 1–36. Google ScholarDigital Library
Haixia Chai, Wei Zhao, Steffen Eger, and Michael Strube. 2020. Evaluation of Coreference Resolution Systems Under Adversarial Attacks. In Proceedings of the First Workshop on Computational Approaches to Discourse. Association for Computational Linguistics, Online. 154–159. https://doi.org/10.18653/v1/2020.codi-1.16 Google ScholarCross Ref
Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. 2021. Validation on Machine Reading Comprehension Software without Annotated Labels: A Property-Based Method. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA. 590–602. isbn:9781450385626 https://doi.org/10.1145/3468264.3468569 Google ScholarDigital Library
Tsong Yueh Chen, S. C. Cheung, and Siu-Ming Yiu. 2020. Metamorphic Testing: A New Approach for Generating Next Test Cases. In Technical Report HKUST-CS98-01. CoRR, abs/2002.12543, 11. arXiv:2002.12543. arxiv:2002.12543 Google Scholar
Yu-Hsin Chen and Jinho D. Choi. 2016. Character Identification on Multiparty Conversation: Identifying Mentions of Characters in TV Shows. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Association for Computational Linguistics, Los Angeles. 90–100. https://doi.org/10.18653/v1/W16-3612 Google ScholarCross Ref
Kevin Clark and Christopher D. Manning. 2015. Entity-Centric Coreference Resolution with Model Stacking. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Beijing, China. 1405–1415. https://doi.org/10.3115/v1/P15-1136 Google ScholarCross Ref
Kevin Clark and Christopher D. Manning. 2016. Deep Reinforcement Learning for Mention-Ranking Coreference Models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas. 2256–2262. https://doi.org/10.18653/v1/D16-1245 Google ScholarCross Ref
Kevin Clark and Christopher D. Manning. 2016. Improving Coreference Resolution by Learning Entity-Level Distributed Representations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany. 643–653. Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT. Association for Computational Linguistics, Online. 4171–4186. Google Scholar
Greg Durrett and Dan Klein. 2013. Easy Victories and Uphill Battles in Coreference Resolution. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA. 1971–1982. https://aclanthology.org/D13-1203 Google Scholar
Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2018. HotFlip: White-Box Adversarial Examples for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Melbourne, Australia. 31–36. https://doi.org/10.18653/v1/P18-2006 Google ScholarCross Ref
Steffen Eger, Gözde Gül Şahin, Andreas Rücklé, Ji-Ung Lee, Claudia Schulz, Mohsen Mesgar, Krishnkant Swarnkar, Edwin Simpson, and Iryna Gurevych. 2019. Text Processing Like Humans Do: Visually Attacking and Shielding NLP Systems. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota. 1634–1647. https://doi.org/10.18653/v1/N19-1165 Google ScholarCross Ref
Pradheep Elango. 2005. Coreference resolution: A survey. University of Wisconsin, Madison, WI, 1, 12 (2005), 12. Google Scholar
Max Glockner, Vered Shwartz, and Yoav Goldberg. 2018. Breaking NLI Systems with Sentences that Require Simple Lexical Inferences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Melbourne, Australia. 650–655. https://doi.org/10.18653/v1/P18-2103 Google ScholarCross Ref
Shashij Gupta, Pinjia He, Clara Meister, and Zhendong Su. 2020. Machine Translation Testing via Pathological Invariance. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020). Association for Computing Machinery, New York, NY, USA. 863–875. isbn:9781450370431 https://doi.org/10.1145/3368089.3409756 Google ScholarDigital Library
Aria Haghighi and Dan Klein. 2010. Coreference Resolution in a Modular, Entity-Centered Model. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, USA. 385–393. Google Scholar
Pinjia He, Clara Meister, and Zhendong Su. 2020. Structure-Invariant Testing for Machine Translation. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE ’20). Association for Computing Machinery, New York, NY, USA. 961–973. isbn:9781450371216 https://doi.org/10.1145/3377811.3380339 Google ScholarDigital Library
Pinjia He, Clara Meister, and Zhendong Su. 2021. Testing Machine Translation via Referential Transparency. In 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021. IEEE, Madrid, Spain. 410–422. https://doi.org/10.1109/ICSE43902.2021.00047 Google ScholarDigital Library
Xuanli He, Lingjuan Lyu, Lichao Sun, and Qiongkai Xu. 2021. Model Extraction and Adversarial Transferability, Your BERT is Vulnerable!. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online. 2006–2012. https://doi.org/10.18653/v1/2021.naacl-main.161 Google ScholarCross Ref
Lynette Hirschman and Nancy Chinchor. 1998. Appendix F: MUC-7 Coreference Task Definition (version 3.0). In Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, April 29 - May 1, 1998. Proceedings of a Conference Held in Fairfax, Virginia, Fairfax, Virginia. 17. https://aclanthology.org/M98-1029 Google Scholar
J Hobbs. 1986. Resolving Pronoun References. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 339–352. isbn:0934613117 Google Scholar
Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. spaCy: Industrial-strength Natural Language Processing in Python. https://doi.org/10.5281/zenodo.1212303 Google ScholarCross Ref
Jen-tse Huang, Jianping Zhang, Wenxuan Wang, Pinjia He, Yuxin Su, and Michael R. Lyu. 2022. AEON: A Method for Automatic Evaluation of NLP Test Cases. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2022). Association for Computing Machinery, New York, NY, USA. 202–214. isbn:9781450393799 https://doi.org/10.1145/3533767.3534394 Google ScholarDigital Library
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany. 2073–2083. Google ScholarCross Ref
Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. 2018. Adversarial Example Generation with Syntactically Controlled Paraphrase Networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, Melbourne, Australia. 1875–1885. https://doi.org/10.18653/v1/N18-1170 Google ScholarCross Ref
Heng Ji and Joel Nothman. 2016. Overview of TAC-KBP2016 Tri-lingual EDL and Its Impact on End-to-End KBP. In Proceedings of the 2016 Text Analysis Conference, TAC 2016, Gaithersburg, Maryland, USA, November 14-15, 2016. NIST, USA. 15. Google Scholar
Mandar Joshi, Omer Levy, Luke Zettlemoyer, and Daniel Weld. 2019. BERT for Coreference Resolution: Baselines and Analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China. 5803–5808. https://doi.org/10.18653/v1/D19-1588 Google ScholarCross Ref
Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, and Marcus Rohrbach. 2018. Visual Coreference Resolution in Visual Dialog Using Neural Module Networks. In Computer Vision – ECCV 2018, Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer International Publishing, Cham. 160–178. isbn:978-3-030-01267-0 Google ScholarDigital Library
Jonathan K. Kummerfeld and Dan Klein. 2013. Error-Driven Analysis of Challenges in Coreference Resolution. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA. 265–277. Google Scholar
Heeyoung Lee, Yves Peirsman, Angel Chang, Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. 2011. Stanford’s Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task. Association for Computational Linguistics, Portland, Oregon, USA. 28–34. https://aclanthology.org/W11-1902 Google Scholar
Kenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer. 2017. End-to-end Neural Coreference Resolution. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark. 188–197. https://doi.org/10.18653/v1/D17-1018 Google ScholarCross Ref
Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018. Higher-Order Coreference Resolution with Coarse-to-Fine Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, New Orleans, Louisiana. 687–692. https://doi.org/10.18653/v1/N18-2108 Google ScholarCross Ref
Zhengyuan Liu, Ke Shi, and Nancy F. Chen. 2021. Coreference-Aware Dialogue Summarization. In Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, SIGdial 2021, Singapore and Online, July 29-31, 2021, Haizhou Li, Gina-Anne Levow, Zhou Yu, Chitralekha Gupta, Berrak Sisman, Siqi Cai, David Vandyke, Nina Dethlefs, Yan Wu, and Junyi Jessy Li (Eds.). Association for Computational Linguistics, Singapore and Online. 509–519. https://aclanthology.org/2021.sigdial-1.53 Google ScholarCross Ref
Jing Lu and Vincent Ng. 2020. Conundrums in Entity Coreference Resolution: Making Sense of the State of the Art. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online. 6620–6631. https://doi.org/10.18653/v1/2020.emnlp-main.536 Google ScholarCross Ref
Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, Baltimore, Maryland. 55–60. https://doi.org/10.3115/v1/P14-5010 Google ScholarCross Ref
Silverio Mart’inez-Fern’andez, Justus Bogner, Xavier Franch, Marc Oriol, Julien Siebert, Adam Trendowicz, Anna Maria Vollmer, and Stefan Wagner. 2022. Software Engineering for AI-Based Systems: A Survey. ACM Transactions on Software Engineering and Methodology (TOSEM), 31 (2022), 1 – 59. Google ScholarDigital Library
George A. Miller. 1995. WordNet: A Lexical Database for English. Commun. ACM, 38, 11 (1995), nov, 39–41. issn:0001-0782 https://doi.org/10.1145/219717.219748 Google ScholarDigital Library
John Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. 2020. TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online. 119–126. Google ScholarCross Ref
Thomas S. Morton. 1999. Using Coreference for Question Answering. In Proceedings of the Workshop on Coreference and Its Applications (CorefApp ’99). Association for Computational Linguistics, USA. 85–89. Google ScholarDigital Library
Vincent Ng. 2017. Machine Learning for Entity Coreference Resolution: A Retrospective Look at Two Decades of Research. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI’17). AAAI Press, San Francisco, California, USA. 4877–4884. Google ScholarCross Ref
Vincent Ng and Claire Cardie. 2002. Improving Machine Learning Approaches to Coreference Resolution. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA. 104–111. https://doi.org/10.3115/1073083.1073102 Google ScholarDigital Library
Yulei Niu, Hanwang Zhang, Manli Zhang, Jianhong Zhang, Zhiwu Lu, and Ji-Rong Wen. 2019. Recursive Visual Attention in Visual Dialog. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, Long Beach, CA, USA. 6679–6688. https://doi.org/10.1109/CVPR.2019.00684 Google ScholarCross Ref
Daniel Pesu, Zhi Quan Zhou, Jingfeng Zhen, and Dave Towey. 2018. A Monte Carlo Method for Metamorphic Testing of Machine Translation Services. In 3rd IEEE/ACM International Workshop on Metamorphic Testing MET. ACM, Gothenburg, Sweden. 38–45. Google ScholarDigital Library
Michael Pradel and Koushik Sen. 2018. Deepbugs: A learning approach to name-based bug detection. Proceedings of the ACM on Programming Languages, 2, OOPSLA (2018), 1–25. Google ScholarDigital Library
Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Björkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. 2013. Towards Robust Linguistic Analysis using OntoNotes. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning. Association for Computational Linguistics, Seattle, Washington, USA. 143–152. Google Scholar
Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. 2012. CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes. In Joint Conference on EMNLP and CoNLL - Shared Task (CoNLL ’12). Association for Computational Linguistics, USA. 1–40. Google Scholar
Karthik Raghunathan, Heeyoung Lee, Sudarshan Rangarajan, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky, and Christopher Manning. 2010. A Multi-Pass Sieve for Coreference Resolution. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Cambridge, MA. 492–501. https://aclanthology.org/D10-1048 Google ScholarDigital Library
Marta Recasens, Marie-Catherine de Marneffe, and Christopher Potts. 2013. The Life and Death of Discourse Entities: Identifying Singleton Mentions. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Atlanta, Georgia. 627–633. https://aclanthology.org/N13-1071 Google Scholar
M. Recasens and E. Hovy. 2011. Blanc: Implementing the Rand Index for Coreference Evaluation. Nat. Lang. Eng., 17, 4 (2011), oct, 485–510. issn:1351-3249 https://doi.org/10.1017/S135132491000029X Google ScholarDigital Library
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019. Association for Computational Linguistics, Online. 3980–3990. Google ScholarCross Ref
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Semantically Equivalent Adversarial Rules for Debugging NLP models. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia. 856–865. https://doi.org/10.18653/v1/P18-1079 Google ScholarCross Ref
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online. 4902–4912. https://doi.org/10.18653/v1/2020.acl-main.442 Google ScholarCross Ref
Walter Simoncini and Gerasimos Spanakis. 2021. SeqAttack: On adversarial attacks for named entity recognition. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online. 308–318. Google Scholar
Ezekiel Soremekun, Sakshi Udeshi, and Sudipta Chattopadhyay. 2022. Astraea: Grammar-Based Fairness Testing. IEEE Transactions on Software Engineering, 48, 12 (2022), 5188–5211. https://doi.org/10.1109/TSE.2022.3141758 Google ScholarCross Ref
Dario Stojanovski and Alexander Fraser. 2018. Coreference and Coherence in Neural Machine Translation: A Study Using Oracle Experiments. In Proceedings of the Third Conference on Machine Translation: Research Papers. Association for Computational Linguistics, Brussels, Belgium. 49–60. https://doi.org/10.18653/v1/W18-6306 Google ScholarCross Ref
Zeyu Sun, Jie M. Zhang, Mark Harman, Mike Papadakis, and Lu Zhang. 2020. Automatic Testing and Improvement of Machine Translation. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE ’20). Association for Computing Machinery, New York, NY, USA. 974–985. isbn:9781450371216 https://doi.org/10.1145/3377811.3380420 Google ScholarDigital Library
Zeyu Sun, Jie M. Zhang, Yingfei Xiong, Mark Harman, Mike Papadakis, and Lu Zhang. 2022. Improving Machine Translation Systems via Isotopic Replacement. In Proceedings of the 44th International Conference on Software Engineering (ICSE ’22). Association for Computing Machinery, New York, NY, USA. 1181–1192. isbn:9781450392211 https://doi.org/10.1145/3510003.3510206 Google ScholarDigital Library
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In International Conference on Learning Representations. International Conference on Learning Representations, Banff, Canada. 10. Google Scholar
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal Adversarial Triggers for Attacking and Analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China. 2153–2162. https://doi.org/10.18653/v1/D19-1221 Google ScholarCross Ref
Wenqi Wang, Run Wang, Lina Wang, Zhibo Wang, and Aoshuang Ye. 2023. Towards a Robust Deep Neural Network Against Adversarial Texts: A Survey. IEEE Transactions on Knowledge and Data Engineering, 35, 3 (2023), 3159–3179. https://doi.org/10.1109/TKDE.2021.3117608 Google ScholarCross Ref
Xiao Wang, Qin Liu, Tao Gui, and Qi Zhang. 2021. TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online. 347–355. https://doi.org/10.18653/v1/2021.acl-demo.41 Google ScholarCross Ref
Xiao Wang, Qin Liu, Tao Gui, Qi Zhang, Yicheng Zou, Xin Zhou, Jiacheng Ye, Yongxin Zhang, Rui Zheng, Zexiong Pang, Qinzhuo Wu, Zhengyan Li, Chong Zhang, Ruotian Ma, Zichu Fei, Ruijian Cai, Jun Zhao, Xingwu Hu, Zhiheng Yan, Yiding Tan, Yuan Hu, Qiyuan Bian, Zhihua Liu, Shan Qin, Bolin Zhu, Xiaoyu Xing, Jinlan Fu, Yue Zhang, Minlong Peng, Xiaoqing Zheng, Yaqian Zhou, Zhongyu Wei, Xipeng Qiu, and Xuanjing Huang. 2021. TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online. 347–355. https://doi.org/10.18653/v1/2021.acl-demo.41 Google ScholarCross Ref
Han Xu, Yao Ma, Hao-Chen Liu, Debayan Deb, Hui Liu, Ji-Liang Tang, and Anil K Jain. 2020. Adversarial attacks and defenses in images, graphs and text: A review. International Journal of Automation and Computing, 17, 2 (2020), 151–178. Google ScholarCross Ref
Xintong Yu, Hongming Zhang, Yangqiu Song, Changshui Zhang, Kun Xu, and Dong Yu. 2021. Exophoric Pronoun Resolution in Dialogues with Topic Regularization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic. 3832–3845. https://doi.org/10.18653/v1/2021.emnlp-main.311 Google ScholarCross Ref
Guoyang Zeng, Fanchao Qi, Qianrui Zhou, Tingji Zhang, Zixian Ma, Bairu Hou, Yuan Zang, Zhiyuan Liu, and Maosong Sun. 2021. OpenAttack: An Open-source Textual Adversarial Attack Toolkit. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online. 363–371. https://doi.org/10.18653/v1/2021.acl-demo.43 Google ScholarCross Ref
Hongming Zhang, Xinran Zhao, and Yangqiu Song. 2021. A Brief Survey and Comparative Study of Recent Development of Pronoun Coreference Resolution in English. In Proceedings of the Fourth Workshop on Computational Models of Reference, Anaphora and Coreference. Association for Computational Linguistics, Punta Cana, Dominican Republic. 1–11. https://doi.org/10.18653/v1/2021.crac-1.1 Google ScholarCross Ref
Huangzhao Zhang, Hao Zhou, Ning Miao, and Lei Li. 2019. Generating Fluent Adversarial Examples for Natural Languages. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, Anna Korhonen, David R. Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy. 5564–5569. https://doi.org/10.18653/v1/p19-1559 Google ScholarCross Ref
Wei Emma Zhang, Quan Z. Sheng, Ahoud Alhazmi, and Chenliang Li. 2020. Adversarial Attacks on Deep-Learning Models in Natural Language Processing: A Survey. ACM Trans. Intell. Syst. Technol., 11, 3 (2020), Article 24, apr, 41 pages. issn:2157-6904 https://doi.org/10.1145/3374217 Google ScholarDigital Library
Zhi Quan Zhou and Liqun Sun. 2018. Metamorphic Testing for Machine Translations: MT4MT. In Proceedings of the 25th Australasian Software Engineering Conference (ASWEC). IEEE Computer Society, Adelaide, SA, Australia. 96–100. Google Scholar
Enwei Zhu and Jinpeng Li. 2022. Boundary Smoothing for Named Entity Recognition. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland. 7096–7108. https://doi.org/10.18653/v1/2022.acl-long.490 Google ScholarCross Ref

Index Terms

Testing Coreference Resolution Systems without Labeled Test Sets
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
2. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
  2. Software organization and properties
    1. Software functional properties
      1. Correctness
        Consistency

Recommendations

Fault detection effectiveness of source test case generation strategies for metamorphic testing
MET '18: Proceedings of the 3rd International Workshop on Metamorphic Testing

Metamorphic testing is a well known approach to tackle the oracle problem in software testing. This technique requires the use of source test cases that serve as seeds for the generation of follow-up test cases. Systematic design of test cases is ...
Read More
MD-ART: a test case generation method without test oracle problem
SCTDCP 2016: Proceedings of the 1st International Workshop on Specification, Comprehension, Testing, and Debugging of Concurrent Programs

Adaptive random testing (ART), as an improved random testing method, preserves the advantages of traditional random test method and overcomes the blindness of traditional random testing method. But it is usually not easy to validate the correctness of ...
Read More
Fault-based testing without the need of oracles
Abstract
There are two fundamental limitations in software testing, known as the reliable test set problem and the oracle problem. Fault-based testing is an attempt by Morell to alleviate the reliable test set problem. In this paper, we propose ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
November 2023
2215 pages
ISBN:9798400703270
DOI:10.1145/3611643
General Chair:
Satish Chandra
Google, USA
,
Program Chairs:
Kelly Blincoe
University of Auckland, New Zealand
,
Paolo Tonella
USI Lugano, Switzerland
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 November 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Artifacts Available / v1.1
Author Tags
Coreference resolution testing
Metamorphic testing
SE4AI
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate112of543submissions,21%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 106
  Total Downloads
- Downloads (Last 12 months)106
- Downloads (Last 6 weeks)20
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Testing Coreference Resolution Systems without Labeled Test Sets

ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fault detection effectiveness of source test case generation strategies for metamorphic testing

MD-ART: a test case generation method without test oracle problem

Fault-based testing without the need of oracles