Abstract
Accuracy metrics and explanation of outputs can provide users with useful information about the performance of machine learning-based systems. However, the availability of this information can result in users’ overlooking potential problems in the system. This paper investigates whether making errors obvious to the user can influence trust towards a system that has high accuracy but has flaws. In order to test this hypothesis, a series of experiments with different settings were conducted. Participants were shown examples of the predictions of text classification systems, the explanation of those predictions and the overall accuracy of the systems. The participants were then asked to evaluate the systems based on those pieces of information and to indicate the reason for their evaluation decision. The results show that participants who were shown examples where there was a pattern of errors in the explanation were less willing to recommend or choose a system even if the system’s accuracy metric was higher. In addition, fewer participants reported that the accuracy metric was the reason for their choice, and more participants mentioned the prediction explanation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Adadi, A., Berrada, M.: Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 6, 52138–52160 (2018). https://doi.org/10.1109/ACCESS.2018.2870052
Amershi, S., et al.: Software engineering for machine learning: a case study. In: Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, ICSE-SEIP 2010, pp. 291–300. IEEE Press, Piscataway, NJ, USA (2019). https://doi.org/10.1109/ICSE-SEIP.2019.00042
Borkan, D., Dixon, L., Sorensen, J., Thain, N., Vasserman, L.: Nuanced metrics for measuring unintended bias with real data for text classification. In: Companion Proceedings of the 2019 World Wide Web Conference, WWW 2019, pp. 491–500. ACM, New York, NY, USA (2019). https://doi.org/10.1145/3308560.3317593
Bracamonte, V., Hidano, S., Kiyomoto, S.: Effect of errors on the evaluation of machine learning systems. In: VISIGRAPP (2: HUCAPP), pp. 48–57 (2022)
Bussone, A., Stumpf, S., O’Sullivan, D.: The role of explanations on trust and reliance in clinical decision support systems. In: 2015 International Conference on Healthcare Informatics, October 2015, pp. 160–169 (2015). https://doi.org/10.1109/ICHI.2015.26
Cai, C.J., et al.: Human-centered tools for coping with imperfect algorithms during medical decision-making. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, May 2019, pp. 1–14. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3290605.3300234
Chakraborty, S., et al.: Interpretability of deep learning models: a survey of results. In: 2017 IEEE Smartworld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City innovation (smartworld/SCALCOM/UIC/ATC/CBDcom/IOP/SCI), pp. 1–6. IEEE (2017)
Chen, J., Song, L., Wainwright, M.J., Jordan, M.I.: L-Shapley and C-Shapley: efficient model interpretation for structured data. arXiv preprint arXiv:1808.02610 (2018)
Cheng, H.F., et al.: Explaining decision-making algorithms through UI: strategies to help non-expert stakeholders. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, May 2019, pp. 1–12. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3290605.3300789
de Vries, P., Midden, C., Bouwhuis, D.: The effects of errors on system trust, self-confidence, and the allocation of control in route planning. Int. J. Hum Comput Stud. 58(6), 719–735 (2003). https://doi.org/10.1016/S1071-5819(03)00039-9
Dixon, L., Li, J., Sorensen, J., Thain, N., Vasserman, L.: Measuring and mitigating unintended bias in text classification. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, December 2018, pp. 67–73. ACM (2018). https://doi.org/10.1145/3278721.3278729
Dzindolet, M.T., Peterson, S.A., Pomranky, R.A., Pierce, L.G., Beck, H.P.: The role of trust in automation reliance. Int. J. Hum Comput Stud. 58(6), 697–718 (2003). https://doi.org/10.1016/S1071-5819(03)00038-7
Eslami, M., Vaccaro, K., Lee, M.K., Elazari Bar On, A., Gilbert, E., Karahalios, K.: User attitudes towards algorithmic opacity and transparency in online reviewing platforms. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, May 2019, pp. 1–14. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3290605.3300724
Frison, A.K., et al.: In UX we trust: investigation of aesthetics and usability of driver-vehicle interfaces and their impact on the perception of automated driving. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, May 2019, pp. 1–13. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3290605.3300374
Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., Pedreschi, D.: A survey of methods for explaining black box models. ACM Comput. Surv. 51(5), 93:1-93:42 (2018). https://doi.org/10.1145/3236009
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Hoff, K.A., Bashir, M.: Trust in automation: integrating empirical evidence on factors that influence trust. Hum. Factors 57(3), 407–434 (2015)
Jigsaw: Unintended bias and names of frequently targeted groups (2018). https://medium.com/the-false-positive/unintended-bias-and-names-of-frequently-targeted-groups-8e0b81f80a23
Kaur, H., Nori, H., Jenkins, S., Caruana, R., Wallach, H., Wortman Vaughan, J.: Interpreting interpretability: understanding data scientists’ use of interpretability tools for machine learning. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, CHI 2020, April 2020, pp. 1–14. Association for Computing Machinery, Honolulu, HI, USA (2020). https://doi.org/10.1145/3313831.3376219
Keras: Keras documentation: about Keras (2021). https://keras.io/about/
Kizilcec, R.F.: How much information? Effects of transparency on trust in an algorithmic interface. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI 2016, May 2016, pp. 2390–2395. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2858036.2858402
Kontogiannis, T.: User strategies in recovering from errors in man–machine systems. Saf. Sci. 32(1), 49–68 (1999)
Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI 2015, January 2015, pp. 2267–2273. AAAI Press, Austin, Texas (2015)
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)
Lee, J.D., See, K.A.: Trust in automation: designing for appropriate reliance. Hum. Factors 46(1), 50–80 (2004). https://doi.org/10.1518/hfes.46.1.50_30392
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 4765–4774. Curran Associates, Inc. (2017). http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf
Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, June 2011, pp. 142–150. Association for Computational Linguistics (2011). http://www.aclweb.org/anthology/P11-1015
Mittelstadt, B., Russell, C., Wachter, S.: Explaining explanations in AI. In: Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* 2019, January 2019, pp. 279–288. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3287560.3287574
Nickerson, R.S.: Confirmation bias: a ubiquitous phenomenon in many guises. Rev. Gen. Psychol. 2(2), 175–220 (1998). https://doi.org/10.1037/1089-2680.2.2.175
Nourani, M., King, J., Ragan, E.: The role of domain expertise in user trust and the impact of first impressions with intelligent systems. In: Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, October 2020, vol. 8, pp. 112–121 (2020). https://ojs.aaai.org/index.php/HCOMP/article/view/7469
Raybaud, S., Langlois, D., Smaïli, K.: “This sentence is wrong.’’ Detecting errors in machine-translated sentences. Mach. Transl. 25(1), 1 (2011). https://doi.org/10.1007/s10590-011-9094-9
Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should i trust you?”: explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, pp. 1135–1144. ACM, New York, NY, USA (2016). https://doi.org/10.1145/2939672.2939778
Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intel. 1(5), 206–215 (2019)
Sanchez, J., Rogers, W.A., Fisk, A.D., Rovira, E.: Understanding reliance on automation: effects of error type, error distribution, age and experience. Theor. Issues Ergon. Sci. 15(2), 134–160 (2014). https://doi.org/10.1080/1463922X.2011.611269
Sauer, J., Chavaillaz, A., Wastell, D.: Experience of automation failures in training: effects on trust, automation bias, complacency and performance. Ergonomics 59(6), 767–780 (2016). https://doi.org/10.1080/00140139.2015.1094577
Tenney, I., et al.: The language interpretability tool: extensible, interactive visualizations and analysis for NLP models, August 2020. arXiv:2008.05122 [cs]
West, J.: Jessamyn West on Twitter: “I tested 14 sentences for “perceived toxicity” using Perspectives. Least toxic: I am a man. Most toxic: I am a gay black woman. Come on https://t.co/M4TF9uYtzE”/Twitter (2017)
Wobbrock, J.O., Findlater, L., Gergle, D., Higgins, J.J.: The aligned rank transform for nonparametric factorial analyses using only Anova procedures. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2011, May 2011, pp. 143–146. Association for Computing Machinery, New York, NY, USA (2011). https://doi.org/10.1145/1978942.1978963
Xiong, D., Zhang, M., Li, H.: Error detection for statistical machine translation using linguistic features. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL 2010, July 2010, pp. 604–611. Association for Computational Linguistics, USA (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Bracamonte, V., Hidano, S., Kiyomoto, S. (2023). Influence of Errors on the Evaluation of Text Classification Systems. In: de Sousa, A.A., et al. Computer Vision, Imaging and Computer Graphics Theory and Applications. VISIGRAPP 2022. Communications in Computer and Information Science, vol 1815. Springer, Cham. https://doi.org/10.1007/978-3-031-45725-8_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-45725-8_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45724-1
Online ISBN: 978-3-031-45725-8
eBook Packages: Computer ScienceComputer Science (R0)