extended-abstract

Principles and Interactive Tools for Evaluating and Improving the Behavior of Natural Language Processing models

Author:

Tongshuang WuAuthors Info & Claims

CHI EA '21: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems

Article No.: 65, Pages 1 - 6

https://doi.org/10.1145/3411763.3443423

Published: 08 May 2021 Publication History

Abstract

While the accuracy of Natural Language Processing (NLP) models has been going up, users have more expectations than captured by just accuracy. Despite practitioners’ attempt to inspect model blind spots or lacking capabilities, the status-quo processes can be ad-hoc and biased. My thesis focuses on helping practitioners organize and explore the inputs and outputs of their models, such that they can gain more systematic insights into their models’ behaviors. I identified two building blocks that are essential for informative analysis: (1) to scale up the analysis by grouping similar instances, and (2) to isolate important components by generating counterfactuals. To support multiple analysis stages (training data assessment, error analysis, model testing), I designed various interactive tools that instantiate these two building blocks. In the process, I characterized the design space of grouping and counterfactual generation, seeking to balance the machine powers and practitioners’ domain expertise. My future work proposes to explore how the grouping and counterfactual techniques can benefit non-experts in the data collection process.

References

[1]

Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. 2016. Analyzing the Behavior of Visual Question Answering Models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 1955–1960. https://doi.org/10.18653/v1/D16-1203

[2]

Saleema Amershi, Maya Cakmak, William Bradley Knox, and Todd Kulesza. 2014. Power to the people: The role of humans in interactive machine learning. AI Magazine 35, 4 (2014), 105–120.

Digital Library

[3]

Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Túlio Ribeiro, and Daniel S. Weld. 2020. Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. ArXiv abs/2006.14779(2020), 1—26.

[4]

Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Túlio Ribeiro, and Daniel S. Weld. 2020. Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. ArXiv abs/2006.14779(2020), 1–10.

[5]

Justin Cheng and Michael S. Bernstein. 2015. Flock: Hybrid Crowd-Machine Learning Classifiers. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing (Vancouver, BC, Canada) (CSCW ’15). Association for Computing Machinery, New York, NY, USA, 600–611. https://doi.org/10.1145/2675133.2675214

Digital Library

[6]

Jason Chuang, Margaret E. Roberts, Brandon M. Stewart, Rebecca Weiss, Dustin Tingley, Justin Grimmer, and Jeffrey Heer. 2015. TopicCheck: Interactive Alignment for Assessing Topic Model Stability. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Denver, Colorado, 175–184. https://doi.org/10.3115/v1/N15-1018

[7]

Sara Evensen, C. Ge, Dongjin Choi, and cCaugatay Demiralp. 2020. Data Programming by Demonstration: A Framework for Interactively Learning Labeling Functions. ArXiv abs/2009.01444(2020), 1–6.

[8]

Amir Feder, Nadav Oved, U. Shalit, and Roi Reichart. 2020. CausaLM: Causal Model Explanation Through Counterfactual Language Models. ArXiv abs/2005.13407(2020), 1–50.

[9]

Cristian Felix, Anshul Vikram Pandey, and Enrico Bertini. 2016. Texttile: an interactive visualization tool for seamless exploratory analysis of structured data and unstructured text. IEEE transactions on visualization and computer graphics 23, 1(2016), 161–170.

[10]

Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, Pedro Rodriguez, and Jordan Boyd-Graber. 2018. Pathologies of Neural Models Make Interpretations Difficult. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 3719–3728. https://doi.org/10.18653/v1/D18-1407

[11]

Matt Gardner, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. 2020. Evaluating Models’ Local Decision Boundaries via Contrast Sets. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 1307–1323. https://doi.org/10.18653/v1/2020.findings-emnlp.117

[12]

Peter Hase and Mohit Bansal. 2020. Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior?. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 5540–5552. https://doi.org/10.18653/v1/2020.acl-main.491

[13]

Luheng He, Kenton Lee, Mike Lewis, and Luke Zettlemoyer. 2017. Deep Semantic Role Labeling: What Works and What’s Next. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, 473–483. https://doi.org/10.18653/v1/P17-1044

[14]

Fred Hohman, Kanit Wongsuphasawat, Mary Beth Kery, and Kayur Patel. 2020. Understanding and Visualizing Data Iteration in Machine Learning. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. ACM, Honolulu, Hawaii, 1–13.

Digital Library

[15]

Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. 2018. Adversarial Example Generation with Syntactically Controlled Paraphrase Networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, 1875–1885. https://doi.org/10.18653/v1/N18-1170

[16]

Robin Jia and Percy Liang. 2017. Adversarial Examples for Evaluating Reading Comprehension Systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, 2021–2031. https://doi.org/10.18653/v1/D17-1215

[17]

K. Kafle and C. Kanan. 2017. An Analysis of Visual Question Answering Algorithms. In 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, Venice, Italy, 1983–1991. https://doi.org/10.1109/ICCV.2017.217

[18]

Divyansh Kaushik, Eduard Hovy, and Zachary C Lipton. 2020. Learning the Difference that Makes a Difference with Counterfactually-Augmented Data. International Conference on Learning Representations (ICLR) 1 (2020), 1—17.

[19]

Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, 2018. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning. PMLR, PMLR, Stockholmsmässan, 2668–2677.

[20]

Josua Krause, Adam Perer, and Enrico Bertini. 2014. INFUSE: interactive feature selection for predictive modeling of high dimensional data. IEEE Transactions on Visualization and Computer Graphics 20, 12(2014), 1614–1623.

[21]

Xiaodong Liu, Yelong Shen, Kevin Duh, and Jianfeng Gao. 2018. Stochastic Answer Networks for Machine Reading Comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 1694–1704. https://doi.org/10.18653/v1/P18-1157

[22]

Scott M. Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 4768–4777.

Digital Library

[23]

A. MOHASSEB, M. BADER-El-DEN, and M. COCEA. 2018. Analysis of The Syntactical Structure Of Web Queries. In 2018 International Conference on Machine Learning and Cybernetics (ICMLC), Vol. 2. IEEE, Chengdu, China, 557–562. https://doi.org/10.1109/ICMLC.2018.8526957

[24]

Pramod Kaushik Mudrakarta, Ankur Taly, Mukund Sundararajan, and Kedar Dhamdhere. 2018. Did the Model Understand the Question?. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 1896–1906. https://doi.org/10.18653/v1/P18-1176

[25]

Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid training data creation with weak supervision. Proceedings of the VLDB Endowment 11, 3 (2017), 269–282.

Digital Library

[26]

Marco Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. Association for Computational Linguistics, San Diego, California, 97–101. https://doi.org/10.18653/v1/N16-3020

[27]

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Anchors: High-Precision Model-Agnostic Explanations. In AAAI, Vol. 18. AAAI, New Orleans, Louisiana, 1527–1535.

[28]

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Semantically Equivalent Adversarial Rules for Debugging NLP models. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 856–865. https://doi.org/10.18653/v1/P18-1079

[29]

Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 4902–4912. https://doi.org/10.18653/v1/2020.acl-main.442

[30]

Mattia Samory and Tanushree Mitra. 2019. SENPAI: Supporting Exploratory Text Analysis through Semantic & Syntactic Pattern Inspection. Proceedings of the International AAAI Conference on Web and Social Media 13, 01 (Jul. 2019), 452–462. https://ojs.aaai.org/index.php/ICWSM/article/view/3243

[31]

Gaurav Trivedi, Phuong Pham, Wendy W Chapman, Rebecca Hwa, Janyce Wiebe, and Harry Hochheiser. 2018. NLPReViz: an interactive tool for natural language processing on clinical text. Journal of the American Medical Informatics Association 25, 1(2018), 81–87.

[32]

Lifu Tu, Garima Lalwani, Spandana Gella, and He He. 2020. An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models. Transactions of the Association for Computational Linguistics 8 (2020), 621–633.

[33]

Soumya Wadhwa, Khyathi Chandu, and Eric Nyberg. 2018. Comparative Analysis of Neural QA models on SQuAD. In Proceedings of the Workshop on Machine Reading for Question Answering. Association for Computational Linguistics, Melbourne, Australia, 89–97. https://doi.org/10.18653/v1/W18-2610

[34]

Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer, and Daniel Weld. 2019. Errudite: Scalable, Reproducible, and Testable Error Analysis. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 747–763. https://doi.org/10.18653/v1/P19-1073

[35]

Tongshuang Wu, Daniel S Weld, and Jeffrey Heer. 2019. Local decision pitfalls in interactive machine learning: An investigation into feature selection in sentiment analysis. ACM Transactions on Computer-Human Interaction (TOCHI) 26, 4(2019), 1–27.

Digital Library

[36]

Tongshuang Wu, Kanit Wongsuphasawat, Donghao Ren, Kayur Patel, and Chris DuBois. 2020. Tempura: Query Analysis with Structural Templates. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3313831.3376451

Digital Library

Recommendations

Crowd-Based Personalized Natural Language Explanations for Recommendations
RecSys '16: Proceedings of the 10th ACM Conference on Recommender Systems

Explanations are important for users to make decisions on whether to take recommendations. However, algorithm generated explanations can be overly simplistic and unconvincing. We believe that humans can overcome these limitations. Inspired by how people ...
Evaluating Contrastive and Non-contrastive Explanations for Language Models
Web Information Systems Engineering – WISE 2024
Abstract
Understanding why language models (LMs) make specific decisions is crucial for their deployment in critical applications such as healthcare and legal systems, where transparency is essential. Social science research suggests that humans often ...
Improving Causal Inference of Large Language Models with SCM Tools
Natural Language Processing and Chinese Computing
Abstract
Many previous studies have shown that Large Language Models (LLMs) are highly competent on many Natural Language Processing (NLP) tasks. However, a recent study showed the poor ability of LLMs to perform causal inference based on causal graphs and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CHI EA '21: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems

May 2021

2965 pages

ISBN:9781450380959

DOI:10.1145/3411763

Editors:
Yoshifumi Kitamura
Tohoku University, Japan
,
Aaron Quigley
University of New South Wales, Australia
,
Katherine Isbister
University of California Santa Cruz, USA
,
Takeo Igarashi
The University of Tokyo, Japan

Copyright © 2021 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2021

Check for updates

Author Tags

Qualifiers

Extended-abstract
Research
Refereed limited

Conference

CHI '21

Sponsor:

SIGCHI

CHI '21: CHI Conference on Human Factors in Computing Systems

May 8 - 13, 2021

Yokohama, Japan

Acceptance Rates

Overall Acceptance Rate 6,164 of 23,696 submissions, 26%

Upcoming Conference

CHI 2025

Sponsor:
sigchi

ACM CHI Conference on Human Factors in Computing Systems

April 26 - May 1, 2025

Yokohama , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
311
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)3

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten