Skip to main content

Influence of Errors on the Evaluation of Text Classification Systems

  • Conference paper
  • First Online:
Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022)

Abstract

Accuracy metrics and explanation of outputs can provide users with useful information about the performance of machine learning-based systems. However, the availability of this information can result in users’ overlooking potential problems in the system. This paper investigates whether making errors obvious to the user can influence trust towards a system that has high accuracy but has flaws. In order to test this hypothesis, a series of experiments with different settings were conducted. Participants were shown examples of the predictions of text classification systems, the explanation of those predictions and the overall accuracy of the systems. The participants were then asked to evaluate the systems based on those pieces of information and to indicate the reason for their evaluation decision. The results show that participants who were shown examples where there was a pattern of errors in the explanation were less willing to recommend or choose a system even if the system’s accuracy metric was higher. In addition, fewer participants reported that the accuracy metric was the reason for their choice, and more participants mentioned the prediction explanation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Adadi, A., Berrada, M.: Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 6, 52138–52160 (2018). https://doi.org/10.1109/ACCESS.2018.2870052

    Article  Google Scholar 

  2. Amershi, S., et al.: Software engineering for machine learning: a case study. In: Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, ICSE-SEIP 2010, pp. 291–300. IEEE Press, Piscataway, NJ, USA (2019). https://doi.org/10.1109/ICSE-SEIP.2019.00042

  3. Borkan, D., Dixon, L., Sorensen, J., Thain, N., Vasserman, L.: Nuanced metrics for measuring unintended bias with real data for text classification. In: Companion Proceedings of the 2019 World Wide Web Conference, WWW 2019, pp. 491–500. ACM, New York, NY, USA (2019). https://doi.org/10.1145/3308560.3317593

  4. Bracamonte, V., Hidano, S., Kiyomoto, S.: Effect of errors on the evaluation of machine learning systems. In: VISIGRAPP (2: HUCAPP), pp. 48–57 (2022)

    Google Scholar 

  5. Bussone, A., Stumpf, S., O’Sullivan, D.: The role of explanations on trust and reliance in clinical decision support systems. In: 2015 International Conference on Healthcare Informatics, October 2015, pp. 160–169 (2015). https://doi.org/10.1109/ICHI.2015.26

  6. Cai, C.J., et al.: Human-centered tools for coping with imperfect algorithms during medical decision-making. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, May 2019, pp. 1–14. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3290605.3300234

  7. Chakraborty, S., et al.: Interpretability of deep learning models: a survey of results. In: 2017 IEEE Smartworld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City innovation (smartworld/SCALCOM/UIC/ATC/CBDcom/IOP/SCI), pp. 1–6. IEEE (2017)

    Google Scholar 

  8. Chen, J., Song, L., Wainwright, M.J., Jordan, M.I.: L-Shapley and C-Shapley: efficient model interpretation for structured data. arXiv preprint arXiv:1808.02610 (2018)

  9. Cheng, H.F., et al.: Explaining decision-making algorithms through UI: strategies to help non-expert stakeholders. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, May 2019, pp. 1–12. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3290605.3300789

  10. de Vries, P., Midden, C., Bouwhuis, D.: The effects of errors on system trust, self-confidence, and the allocation of control in route planning. Int. J. Hum Comput Stud. 58(6), 719–735 (2003). https://doi.org/10.1016/S1071-5819(03)00039-9

    Article  Google Scholar 

  11. Dixon, L., Li, J., Sorensen, J., Thain, N., Vasserman, L.: Measuring and mitigating unintended bias in text classification. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, December 2018, pp. 67–73. ACM (2018). https://doi.org/10.1145/3278721.3278729

  12. Dzindolet, M.T., Peterson, S.A., Pomranky, R.A., Pierce, L.G., Beck, H.P.: The role of trust in automation reliance. Int. J. Hum Comput Stud. 58(6), 697–718 (2003). https://doi.org/10.1016/S1071-5819(03)00038-7

    Article  Google Scholar 

  13. Eslami, M., Vaccaro, K., Lee, M.K., Elazari Bar On, A., Gilbert, E., Karahalios, K.: User attitudes towards algorithmic opacity and transparency in online reviewing platforms. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, May 2019, pp. 1–14. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3290605.3300724

  14. Frison, A.K., et al.: In UX we trust: investigation of aesthetics and usability of driver-vehicle interfaces and their impact on the perception of automated driving. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, May 2019, pp. 1–13. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3290605.3300374

  15. Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., Pedreschi, D.: A survey of methods for explaining black box models. ACM Comput. Surv. 51(5), 93:1-93:42 (2018). https://doi.org/10.1145/3236009

    Article  Google Scholar 

  16. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  17. Hoff, K.A., Bashir, M.: Trust in automation: integrating empirical evidence on factors that influence trust. Hum. Factors 57(3), 407–434 (2015)

    Article  Google Scholar 

  18. Jigsaw: Unintended bias and names of frequently targeted groups (2018). https://medium.com/the-false-positive/unintended-bias-and-names-of-frequently-targeted-groups-8e0b81f80a23

  19. Kaur, H., Nori, H., Jenkins, S., Caruana, R., Wallach, H., Wortman Vaughan, J.: Interpreting interpretability: understanding data scientists’ use of interpretability tools for machine learning. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, CHI 2020, April 2020, pp. 1–14. Association for Computing Machinery, Honolulu, HI, USA (2020). https://doi.org/10.1145/3313831.3376219

  20. Keras: Keras documentation: about Keras (2021). https://keras.io/about/

  21. Kizilcec, R.F.: How much information? Effects of transparency on trust in an algorithmic interface. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI 2016, May 2016, pp. 2390–2395. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2858036.2858402

  22. Kontogiannis, T.: User strategies in recovering from errors in man–machine systems. Saf. Sci. 32(1), 49–68 (1999)

    Article  Google Scholar 

  23. Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI 2015, January 2015, pp. 2267–2273. AAAI Press, Austin, Texas (2015)

    Google Scholar 

  24. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)

    Article  MATH  Google Scholar 

  25. Lee, J.D., See, K.A.: Trust in automation: designing for appropriate reliance. Hum. Factors 46(1), 50–80 (2004). https://doi.org/10.1518/hfes.46.1.50_30392

    Article  Google Scholar 

  26. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 4765–4774. Curran Associates, Inc. (2017). http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf

  27. Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, June 2011, pp. 142–150. Association for Computational Linguistics (2011). http://www.aclweb.org/anthology/P11-1015

  28. Mittelstadt, B., Russell, C., Wachter, S.: Explaining explanations in AI. In: Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* 2019, January 2019, pp. 279–288. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3287560.3287574

  29. Nickerson, R.S.: Confirmation bias: a ubiquitous phenomenon in many guises. Rev. Gen. Psychol. 2(2), 175–220 (1998). https://doi.org/10.1037/1089-2680.2.2.175

    Article  Google Scholar 

  30. Nourani, M., King, J., Ragan, E.: The role of domain expertise in user trust and the impact of first impressions with intelligent systems. In: Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, October 2020, vol. 8, pp. 112–121 (2020). https://ojs.aaai.org/index.php/HCOMP/article/view/7469

  31. Raybaud, S., Langlois, D., Smaïli, K.: “This sentence is wrong.’’ Detecting errors in machine-translated sentences. Mach. Transl. 25(1), 1 (2011). https://doi.org/10.1007/s10590-011-9094-9

    Article  Google Scholar 

  32. Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should i trust you?”: explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, pp. 1135–1144. ACM, New York, NY, USA (2016). https://doi.org/10.1145/2939672.2939778

  33. Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intel. 1(5), 206–215 (2019)

    Article  Google Scholar 

  34. Sanchez, J., Rogers, W.A., Fisk, A.D., Rovira, E.: Understanding reliance on automation: effects of error type, error distribution, age and experience. Theor. Issues Ergon. Sci. 15(2), 134–160 (2014). https://doi.org/10.1080/1463922X.2011.611269

    Article  Google Scholar 

  35. Sauer, J., Chavaillaz, A., Wastell, D.: Experience of automation failures in training: effects on trust, automation bias, complacency and performance. Ergonomics 59(6), 767–780 (2016). https://doi.org/10.1080/00140139.2015.1094577

    Article  Google Scholar 

  36. Tenney, I., et al.: The language interpretability tool: extensible, interactive visualizations and analysis for NLP models, August 2020. arXiv:2008.05122 [cs]

  37. West, J.: Jessamyn West on Twitter: “I tested 14 sentences for “perceived toxicity” using Perspectives. Least toxic: I am a man. Most toxic: I am a gay black woman. Come on https://t.co/M4TF9uYtzE”/Twitter (2017)

  38. Wobbrock, J.O., Findlater, L., Gergle, D., Higgins, J.J.: The aligned rank transform for nonparametric factorial analyses using only Anova procedures. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2011, May 2011, pp. 143–146. Association for Computing Machinery, New York, NY, USA (2011). https://doi.org/10.1145/1978942.1978963

  39. Xiong, D., Zhang, M., Li, H.: Error detection for statistical machine translation using linguistic features. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL 2010, July 2010, pp. 604–611. Association for Computational Linguistics, USA (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vanessa Bracamonte .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bracamonte, V., Hidano, S., Kiyomoto, S. (2023). Influence of Errors on the Evaluation of Text Classification Systems. In: de Sousa, A.A., et al. Computer Vision, Imaging and Computer Graphics Theory and Applications. VISIGRAPP 2022. Communications in Computer and Information Science, vol 1815. Springer, Cham. https://doi.org/10.1007/978-3-031-45725-8_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-45725-8_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-45724-1

  • Online ISBN: 978-3-031-45725-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics