ABSTRACT
This paper addresses an under-explored problem of AI-assisted decision-making: when objective performance information of the machine learning model underlying a decision aid is absent or scarce, how do people decide their reliance on the model? Through three randomized experiments, we explore the heuristics people may use to adjust their reliance on machine learning models when performance feedback is limited. We find that the level of agreement between people and a model on decision-making tasks that people have high confidence in significantly affects reliance on the model if people receive no information about the model’s performance, but this impact will change after aggregate-level model performance information becomes available. Furthermore, the influence of high confidence human-model agreement on people’s reliance on a model is moderated by people’s confidence in cases where they disagree with the model. We discuss potential risks of these heuristics, and provide design implications on promoting appropriate reliance on AI.
Supplemental Material
Available for Download
- Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N Bennett, Kori Inkpen, 2019. Guidelines for human-ai interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–13.Google ScholarDigital Library
- Ricardo Baeza-Yates. 2018. Bias on the web. Commun. ACM 61, 6 (2018), 54–61.Google ScholarDigital Library
- Gagan Bansal, Besmira Nushi, Ece Kamar, Walter S Lasecki, Daniel S Weld, and Eric Horvitz. 2019. Beyond Accuracy: The Role of Mental Models in Human-AI Team Performance. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 7. 2–11.Google ScholarCross Ref
- Gagan Bansal, Besmira Nushi, Ece Kamar, Daniel S Weld, Walter S Lasecki, and Eric Horvitz. 2019. Updates in human-ai teams: Understanding and addressing the performance/compatibility tradeoff. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 2429–2437.Google ScholarDigital Library
- Gagan Bansal, Tongshuang Wu, Joyce Zhu, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel S Weld. 2020. Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. (2020). arxiv:2006.14779Google Scholar
- Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency. 77–91.Google Scholar
- Carrie J Cai, Jonas Jongejan, and Jess Holbrook. 2019. The effects of example-based explanations in a machine learning interface. In Proceedings of the 24th International Conference on Intelligent User Interfaces. 258–262.Google ScholarDigital Library
- Eric T Chancey, James P Bliss, Yusuke Yamani, and Holly AH Handley. 2017. Trust and the compliance–reliance paradigm: The effects of risk, error bias, and reliability on trust and dependence. Human factors 59, 3 (2017), 333–345.Google Scholar
- Hao-Fei Cheng, Ruotong Wang, Zheng Zhang, Fiona O’Connell, Terrance Gray, F Maxwell Harper, and Haiyi Zhu. 2019. Explaining decision-making algorithms through UI: Strategies to help non-expert stakeholders. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–12.Google ScholarDigital Library
- Stephen R Dixon, Christopher D Wickens, and Jason S McCarley. 2007. On the independence of compliance and reliance: Are automation false alarms worse than misses?Human factors 49, 4 (2007), 564–572.Google Scholar
- David Dunning. 2014. We are all confident idiots. Pacific Standard 7(2014), 46–54.Google Scholar
- Mary T Dzindolet, Scott A Peterson, Regina A Pomranky, Linda G Pierce, and Hall P Beck. 2003. The role of trust in automation reliance. International journal of human-computer studies 58, 6 (2003), 697–718.Google ScholarDigital Library
- Mary T Dzindolet, Linda G Pierce, Hall P Beck, and Lloyd A Dawe. 1999. Misuse and disuse of automated aids. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Vol. 43. SAGE Publications Sage CA: Los Angeles, CA, 339–339.Google ScholarCross Ref
- Philipp Ecken and Richard Pibernik. 2016. Hit or miss: what leads experts to take advice for long-term judgments?Management Science 62, 7 (2016), 2002–2021.Google Scholar
- Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun. 2017. Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 7639 (2017), 115–118.Google Scholar
- Raymond Fisman, Sheena S Iyengar, Emir Kamenica, and Itamar Simonson. 2006. Gender differences in mate selection: Evidence from a speed dating experiment. The Quarterly Journal of Economics 121, 2 (2006), 673–697.Google ScholarCross Ref
- SM Fleming and ND Daw. 2016. Self-evaluation of decision performance: A general Bayesian framework for metacognitive computation. Psychol Rev 124(2016), 1–59.Google Scholar
- Jorge Galindo and Pablo Tamayo. 2000. Credit risk assessment using statistical and machine learning: basic methodology and risk modeling applications. Computational Economics 15, 1-2 (2000), 107–143.Google ScholarDigital Library
- Ji Gao and John D Lee. 2006. Extending the decision field theory to model operators’ reliance on automation in supervisory control situations. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 36, 5 (2006), 943–959.Google ScholarDigital Library
- Ji Gao, John D Lee, and Yi Zhang. 2006. A dynamic model of interaction between reliance on automation and cooperation in multi-operator multi-automation situations. International Journal of Industrial Ergonomics 36, 5(2006), 511–526.Google ScholarCross Ref
- Efstathios D Gennatas, Jerome H Friedman, Lyle H Ungar, Romain Pirracchio, Eric Eaton, Lara G Reichmann, Yannet Interian, José Marcio Luna, Charles B Simone, Andrew Auerbach, 2020. Expert-augmented machine learning. Proceedings of the National Academy of Sciences 117, 9 (2020), 4571–4577.Google ScholarCross Ref
- Ben Green and Yiling Chen. 2019. The principles and limits of algorithm-in-the-loop decision making. Proceedings of the ACM on Human-Computer Interaction 3, CSCW(2019), 1–24.Google ScholarDigital Library
- Dale W Griffin and Lee Ross. 1991. Subjective construal, social inference, and human misunderstanding. In Advances in experimental social psychology. Vol. 24. Elsevier, 319–359.Google Scholar
- Christoph Hube, Besnik Fetahu, and Ujwal Gadiraju. 2019. Understanding and mitigating worker biases in the crowdsourced collection of subjective judgments. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–12.Google ScholarDigital Library
- Sarfaraz Hussein, Kunlin Cao, Qi Song, and Ulas Bagci. 2017. Risk Stratification of Lung Nodules Using 3D CNN-Based Multi-task Learning. In Information Processing in Medical Imaging, Marc Niethammer, Martin Styner, Stephen Aylward, Hongtu Zhu, Ipek Oguz, Pew-Thian Yap, and Dinggang Shen(Eds.). Springer International Publishing, Cham, 249–260.Google Scholar
- H Kaur, A Williams, and WS Lasecki. 2019. Building shared mental models between humans and ai for effective collaboration. (2019).Google Scholar
- Yea-Seul Kim, Paula Kayongo, Madeleine Grunde-McLaughlin, and Jessica Hullman. 2020. Bayesian-Assisted Inference from Visualized Data. (2020). arxiv:2008.00142Google Scholar
- Gang Kou, Yi Peng, and Guoxun Wang. 2014. Evaluation of clustering algorithms for financial risk analysis using MCDM methods. Information Sciences 275(2014), 1–12.Google ScholarCross Ref
- Justin Kruger and David Dunning. 1999. Unskilled and unaware of it: how difficulties in recognizing one’s own incompetence lead to inflated self-assessments.Journal of personality and social psychology 77, 6(1999), 1121.Google Scholar
- Vivian Lai and Chenhao Tan. 2019. On human predictions with explanations and predictions of machine learning models: A case study on deception detection. In Proceedings of the Conference on Fairness, Accountability, and Transparency. 29–38.Google ScholarDigital Library
- John D Lee and Katrina A See. 2004. Trust in automation: Designing for appropriate reliance. Human factors 46, 1 (2004), 50–80.Google Scholar
- Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. 1993. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks 6, 6 (1993), 861–867.Google Scholar
- Varda Liberman, Julia A Minson, Christopher J Bryan, and Lee Ross. 2012. Naïve realism and capturing the “wisdom of dyads”. Journal of Experimental Social Psychology 48, 2 (2012), 507–512.Google ScholarCross Ref
- Duri Long and Brian Magerko. 2020. What is AI Literacy? Competencies and Design Considerations. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–16.Google ScholarDigital Library
- Maria Madsen and Shirley Gregor. 2000. Measuring human-computer trust. In 11th australasian conference on information systems, Vol. 53. Citeseer, 6–8.Google Scholar
- Gary Marks and Norman Miller. 1987. Ten years of research on the false-consensus effect: An empirical and theoretical review.Psychological bulletin 102, 1 (1987), 72.Google Scholar
- Charles Marx, Flavio Calmon, and Berk Ustun. 2020. Predictive multiplicity in classification. In International Conference on Machine Learning. PMLR, 6765–6774.Google Scholar
- Stephanie M Merritt. 2011. Affective processes in human–automation interactions. Human Factors 53, 4 (2011), 356–370.Google ScholarCross Ref
- Julia A Minson, Varda Liberman, and Lee Ross. 2011. Two to tango: Effects of collaboration and disagreement on dyadic judgment. Personality and Social Psychology Bulletin 37, 10 (2011), 1325–1338.Google ScholarCross Ref
- Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency. 220–229.Google ScholarDigital Library
- Mahsan Nourani, Samia Kabir, Sina Mohseni, and Eric D Ragan. 2019. The Effects of Meaningful and Meaningless Explanations on Trust and Perceived System Accuracy in Intelligent Systems. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 7. 97–105.Google ScholarCross Ref
- Besmira Nushi, Ece Kamar, and E. Horvitz. 2018. Towards Accountable AI: Hybrid Human-Machine Analyses for Characterizing System Failure. In HCOMP.Google Scholar
- Raja Parasuraman and Victor Riley. 1997. Humans and automation: Use, misuse, disuse, abuse. Human factors 39, 2 (1997), 230–253.Google Scholar
- Raja Parasuraman and Christopher D Wickens. 2008. Humans: Still vital after all these years of automation. Human factors 50, 3 (2008), 511–520.Google Scholar
- Niccolo Pescetelli, Geraint Rees, and Bahador Bahrami. 2016. The perceptual and social components of metacognition.Journal of Experimental Psychology: General 145, 8 (2016), 949.Google Scholar
- Niccolo Pescetelli and Nick Yeung. 2018. The role of decision confidence in advice-taking and trust formation. (2018). arxiv:1809.10453Google Scholar
- Forough Poursabzi-Sangdeh, Daniel G Goldstein, Jake M Hofman, Jennifer Wortman Vaughan, and Hanna Wallach. 2018. Manipulating and measuring model interpretability. (2018). arxiv:1802.07810Google Scholar
- Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. ” Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 1135–1144.Google ScholarDigital Library
- Victor Riley. 1996. Operator reliance on automation: Theory and data. Automation and human performance: Theory and applications (1996), 19–35.Google Scholar
- Lee Ross, David Greene, and Pamela House. 1977. The “false consensus effect”: An egocentric bias in social perception and attribution processes. Journal of experimental social psychology 13, 3 (1977), 279–301.Google ScholarCross Ref
- Julian Sanchez, Arthur D Fisk, and Wendy A Rogers. 2004. Reliability and age-related effects on trust and reliance of a decision support aid. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Vol. 48. Sage Publications Sage CA: Los Angeles, CA, 586–589.Google ScholarCross Ref
- Julian Sanchez, Wendy A Rogers, Arthur D Fisk, and Ericka Rovira. 2014. Understanding reliance on automation: effects of error type, error distribution, age and experience. Theoretical issues in ergonomics science 15, 2 (2014), 134–160.Google Scholar
- James Schaffer, John O’Donovan, James Michaelis, Adrienne Raglin, and Tobias Höllerer. 2019. I can do better than your AI: expertise and explanations. In Proceedings of the 24th International Conference on Intelligent User Interfaces. 240–251.Google ScholarDigital Library
- Kelly E See, Elizabeth W Morrison, Naomi B Rothman, and Jack B Soll. 2011. The detrimental effects of power on confidence, advice taking, and accuracy. Organizational behavior and human decision processes 116, 2(2011), 272–285.Google Scholar
- Keng Siau and Weiyu Wang. 2018. Building trust in artificial intelligence, machine learning, and robotics. Cutter Business Technology Journal 31, 2 (2018), 47–53.Google Scholar
- Aaron Van den Oord, Sander Dieleman, and Benjamin Schrauwen. 2013. Deep content-based music recommendation. In Advances in neural information processing systems. 2643–2651.Google Scholar
- Kees Van Dongen and Peter-Paul Van Maanen. 2013. A framework for explaining reliance on decision aids. International Journal of Human-Computer Studies 71, 4 (2013), 410–424.Google ScholarDigital Library
- Lyn M Van Swol and Janet A Sniezek. 2005. Factors affecting the acceptance of expert advice. British journal of social psychology 44, 3 (2005), 443–461.Google Scholar
- Mei Wang, Weihong Deng, Jiani Hu, Xunqiang Tao, and Yaohai Huang. 2019. Racial faces in the wild: Reducing racial bias by information maximization adaptation network. In Proceedings of the IEEE International Conference on Computer Vision. 692–702.Google ScholarCross Ref
- Xinxi Wang and Ye Wang. 2014. Improving content-based and hybrid music recommendation using deep learning. In Proceedings of the 22nd ACM international conference on Multimedia. 627–636.Google ScholarDigital Library
- Andrew Ward, L Ross, E Reed, E Turiel, and T Brown. 1997. Naive realism in everyday life: Implications for social conflict and misunderstanding. Values and knowledge(1997), 103–135.Google Scholar
- Fumeng Yang, Zhuanyi Huang, Jean Scholtz, and Dustin L Arendt. 2020. How do visual explanations foster end users’ appropriate trust in machine learning?. In Proceedings of the 25th International Conference on Intelligent User Interfaces. 189–201.Google ScholarDigital Library
- Ilan Yaniv. 2004. Receiving other people’s advice: Influence and benefit. Organizational behavior and human decision processes 93, 1 (2004), 1–13.Google Scholar
- Ming Yin, Jennifer Wortman Vaughan, and Hanna Wallach. 2019. Understanding the effect of accuracy on trust in machine learning models. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–12.Google ScholarDigital Library
- Yunfeng Zhang, Q Vera Liao, and Rachel KE Bellamy. 2020. Effect of Confidence and Explanation on Accuracy and Trust Calibration in AI-Assisted Decision Making. (2020). arxiv:2001.02114Google Scholar
- Zijian Zhang, Jaspreet Singh, Ujwal Gadiraju, and Avishek Anand. 2019. Dissonance between human and machine understanding. Proceedings of the ACM on Human-Computer Interaction 3, CSCW(2019), 1–23.Google ScholarDigital Library
Index Terms
- Human Reliance on Machine Learning Models When Performance Feedback is Limited: Heuristics and Risks
Recommendations
You’d Better Stop! Understanding Human Reliance on Machine Learning Models under Covariate Shift
WebSci '21: Proceedings of the 13th ACM Web Science Conference 2021Decision-making aids powered by machine learning models become increasingly prevalent on the web today. However, when applied to a new distribution of data that is different from the training data (i.e., when covariate shift occurs), machine learning ...
Exploring the Effects of Machine Learning Literacy Interventions on Laypeople’s Reliance on Machine Learning Models
IUI '22: Proceedings of the 27th International Conference on Intelligent User InterfacesToday, machine learning (ML) technologies have penetrated almost every aspect of people’s lives, yet public understandings of these technologies are often limited. This highlights the urgent need of designing effective methods to increase people’s ...
Human-in-the-Loop Machine Learning to Increase Video Accessibility for Visually Impaired and Blind Users
DIS '20: Proceedings of the 2020 ACM Designing Interactive Systems ConferenceVideo accessibility is crucial for blind and visually impaired individuals for education, employment, and entertainment purposes. However, professional video descriptions are costly and time-consuming. Volunteer-created video descriptions could be a ...
Comments