skip to main content
10.1145/3411763.3443423acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
extended-abstract

Principles and Interactive Tools for Evaluating and Improving the Behavior of Natural Language Processing models

Published: 08 May 2021 Publication History

Abstract

While the accuracy of Natural Language Processing (NLP) models has been going up, users have more expectations than captured by just accuracy. Despite practitioners’ attempt to inspect model blind spots or lacking capabilities, the status-quo processes can be ad-hoc and biased. My thesis focuses on helping practitioners organize and explore the inputs and outputs of their models, such that they can gain more systematic insights into their models’ behaviors. I identified two building blocks that are essential for informative analysis: (1) to scale up the analysis by grouping similar instances, and (2) to isolate important components by generating counterfactuals. To support multiple analysis stages (training data assessment, error analysis, model testing), I designed various interactive tools that instantiate these two building blocks. In the process, I characterized the design space of grouping and counterfactual generation, seeking to balance the machine powers and practitioners’ domain expertise. My future work proposes to explore how the grouping and counterfactual techniques can benefit non-experts in the data collection process.

References

[1]
Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. 2016. Analyzing the Behavior of Visual Question Answering Models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 1955–1960. https://doi.org/10.18653/v1/D16-1203
[2]
Saleema Amershi, Maya Cakmak, William Bradley Knox, and Todd Kulesza. 2014. Power to the people: The role of humans in interactive machine learning. AI Magazine 35, 4 (2014), 105–120.
[3]
Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Túlio Ribeiro, and Daniel S. Weld. 2020. Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. ArXiv abs/2006.14779(2020), 1—26.
[4]
Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Túlio Ribeiro, and Daniel S. Weld. 2020. Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. ArXiv abs/2006.14779(2020), 1–10.
[5]
Justin Cheng and Michael S. Bernstein. 2015. Flock: Hybrid Crowd-Machine Learning Classifiers. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing (Vancouver, BC, Canada) (CSCW ’15). Association for Computing Machinery, New York, NY, USA, 600–611. https://doi.org/10.1145/2675133.2675214
[6]
Jason Chuang, Margaret E. Roberts, Brandon M. Stewart, Rebecca Weiss, Dustin Tingley, Justin Grimmer, and Jeffrey Heer. 2015. TopicCheck: Interactive Alignment for Assessing Topic Model Stability. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Denver, Colorado, 175–184. https://doi.org/10.3115/v1/N15-1018
[7]
Sara Evensen, C. Ge, Dongjin Choi, and cCaugatay Demiralp. 2020. Data Programming by Demonstration: A Framework for Interactively Learning Labeling Functions. ArXiv abs/2009.01444(2020), 1–6.
[8]
Amir Feder, Nadav Oved, U. Shalit, and Roi Reichart. 2020. CausaLM: Causal Model Explanation Through Counterfactual Language Models. ArXiv abs/2005.13407(2020), 1–50.
[9]
Cristian Felix, Anshul Vikram Pandey, and Enrico Bertini. 2016. Texttile: an interactive visualization tool for seamless exploratory analysis of structured data and unstructured text. IEEE transactions on visualization and computer graphics 23, 1(2016), 161–170.
[10]
Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, Pedro Rodriguez, and Jordan Boyd-Graber. 2018. Pathologies of Neural Models Make Interpretations Difficult. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 3719–3728. https://doi.org/10.18653/v1/D18-1407
[11]
Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. 2020. Evaluating Models’ Local Decision Boundaries via Contrast Sets. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 1307–1323. https://doi.org/10.18653/v1/2020.findings-emnlp.117
[12]
Peter Hase and Mohit Bansal. 2020. Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior?. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 5540–5552. https://doi.org/10.18653/v1/2020.acl-main.491
[13]
Luheng He, Kenton Lee, Mike Lewis, and Luke Zettlemoyer. 2017. Deep Semantic Role Labeling: What Works and What’s Next. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 473–483. https://doi.org/10.18653/v1/P17-1044
[14]
Fred Hohman, Kanit Wongsuphasawat, Mary Beth Kery, and Kayur Patel. 2020. Understanding and Visualizing Data Iteration in Machine Learning. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. ACM, Honolulu, Hawaii, 1–13.
[15]
Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. 2018. Adversarial Example Generation with Syntactically Controlled Paraphrase Networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 1875–1885. https://doi.org/10.18653/v1/N18-1170
[16]
Robin Jia and Percy Liang. 2017. Adversarial Examples for Evaluating Reading Comprehension Systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, 2021–2031. https://doi.org/10.18653/v1/D17-1215
[17]
K. Kafle and C. Kanan. 2017. An Analysis of Visual Question Answering Algorithms. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, Venice, Italy, 1983–1991. https://doi.org/10.1109/ICCV.2017.217
[18]
Divyansh Kaushik, Eduard Hovy, and Zachary C Lipton. 2020. Learning the Difference that Makes a Difference with Counterfactually-Augmented Data. International Conference on Learning Representations (ICLR) 1 (2020), 1—17.
[19]
Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, 2018. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning. PMLR, PMLR, Stockholmsmässan, 2668–2677.
[20]
Josua Krause, Adam Perer, and Enrico Bertini. 2014. INFUSE: interactive feature selection for predictive modeling of high dimensional data. IEEE Transactions on Visualization and Computer Graphics 20, 12(2014), 1614–1623.
[21]
Xiaodong Liu, Yelong Shen, Kevin Duh, and Jianfeng Gao. 2018. Stochastic Answer Networks for Machine Reading Comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 1694–1704. https://doi.org/10.18653/v1/P18-1157
[22]
Scott M. Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 4768–4777.
[23]
A. MOHASSEB, M. BADER-El-DEN, and M. COCEA. 2018. Analysis of The Syntactical Structure Of Web Queries. In 2018 International Conference on Machine Learning and Cybernetics (ICMLC), Vol. 2. IEEE, Chengdu, China, 557–562. https://doi.org/10.1109/ICMLC.2018.8526957
[24]
Pramod Kaushik Mudrakarta, Ankur Taly, Mukund Sundararajan, and Kedar Dhamdhere. 2018. Did the Model Understand the Question?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 1896–1906. https://doi.org/10.18653/v1/P18-1176
[25]
Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid training data creation with weak supervision. Proceedings of the VLDB Endowment 11, 3 (2017), 269–282.
[26]
Marco Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. Association for Computational Linguistics, San Diego, California, 97–101. https://doi.org/10.18653/v1/N16-3020
[27]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Anchors: High-Precision Model-Agnostic Explanations. In AAAI, Vol. 18. AAAI, New Orleans, Louisiana, 1527–1535.
[28]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Semantically Equivalent Adversarial Rules for Debugging NLP models. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 856–865. https://doi.org/10.18653/v1/P18-1079
[29]
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 4902–4912. https://doi.org/10.18653/v1/2020.acl-main.442
[30]
Mattia Samory and Tanushree Mitra. 2019. SENPAI: Supporting Exploratory Text Analysis through Semantic & Syntactic Pattern Inspection. Proceedings of the International AAAI Conference on Web and Social Media 13, 01 (Jul. 2019), 452–462. https://ojs.aaai.org/index.php/ICWSM/article/view/3243
[31]
Gaurav Trivedi, Phuong Pham, Wendy W Chapman, Rebecca Hwa, Janyce Wiebe, and Harry Hochheiser. 2018. NLPReViz: an interactive tool for natural language processing on clinical text. Journal of the American Medical Informatics Association 25, 1(2018), 81–87.
[32]
Lifu Tu, Garima Lalwani, Spandana Gella, and He He. 2020. An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models. Transactions of the Association for Computational Linguistics 8 (2020), 621–633.
[33]
Soumya Wadhwa, Khyathi Chandu, and Eric Nyberg. 2018. Comparative Analysis of Neural QA models on SQuAD. In Proceedings of the Workshop on Machine Reading for Question Answering. Association for Computational Linguistics, Melbourne, Australia, 89–97. https://doi.org/10.18653/v1/W18-2610
[34]
Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer, and Daniel Weld. 2019. Errudite: Scalable, Reproducible, and Testable Error Analysis. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 747–763. https://doi.org/10.18653/v1/P19-1073
[35]
Tongshuang Wu, Daniel S Weld, and Jeffrey Heer. 2019. Local decision pitfalls in interactive machine learning: An investigation into feature selection in sentiment analysis. ACM Transactions on Computer-Human Interaction (TOCHI) 26, 4(2019), 1–27.
[36]
Tongshuang Wu, Kanit Wongsuphasawat, Donghao Ren, Kayur Patel, and Chris DuBois. 2020. Tempura: Query Analysis with Structural Templates. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3313831.3376451

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CHI EA '21: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems
May 2021
2965 pages
ISBN:9781450380959
DOI:10.1145/3411763
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2021

Check for updates

Author Tags

  1. Error Analysis
  2. Natural Language Processing
  3. Training Data Assessment

Qualifiers

  • Extended-abstract
  • Research
  • Refereed limited

Conference

CHI '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 6,164 of 23,696 submissions, 26%

Upcoming Conference

CHI 2025
ACM CHI Conference on Human Factors in Computing Systems
April 26 - May 1, 2025
Yokohama , Japan

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 311
    Total Downloads
  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)3
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media