ABSTRACT
Artificial Intelligence (AI) is increasingly playing an integral role in determining our day-to-day experiences. Increasingly, the applications of AI are no longer limited to search and recommendation systems, such as web search and movie and product recommendations, but AI is also being used in decisions and processes that are critical for individuals, businesses, and society. With AI based solutions in high-stakes domains such as hiring, lending, criminal justice, healthcare, and education, the resulting personal and professional implications of AI are far-reaching. Consequently, it becomes critical to ensure that these models are making accurate predictions, are robust to shifts in the data, are not relying on spurious features, and are not unduly discriminating against minority groups. To this end, several approaches spanning various areas such as explainability, fairness, and robustness have been proposed in recent literature, and many papers and tutorials on these topics have been presented in recent computer science conferences. However, there is relatively less attention on the need for monitoring machine learning (ML) models once they are deployed and the associated research challenges.
In this tutorial, we first motivate the need for ML model monitoring[14], as part of a broader AI model governance[9] and responsible AI framework, from societal, legal, customer/end-user, and model developer perspectives, and provide a roadmap for thinking about model monitoring in practice. We then present findings and insights on model monitoring desiderata based on interviews with various ML practitioners spanning domains such as financial services, healthcare, hiring, online retail, computational advertising, and conversational assistants[15]. We then describe the technical considerations and challenges associated with realizing the above desiderata in practice. We provide an overview of techniques/tools for model monitoring (e.g., see [1, 1, 2, 5, 6, 8, 10-13, 18-21]. Then, we focus on the real-world application of model monitoring methods and tools [3, 4, 7, 11, 13, 16, 17], present practical challenges/guidelines for using such techniques effectively, and lessons learned from deploying model monitoring tools for several web-scale AI/ML applications. We present case studies across different companies, spanning application domains such as financial services, healthcare, hiring, conversational assistants, online retail, computational advertising, search and recommendation systems, and fraud detection. We hope that our tutorial will inform both researchers and practitioners, stimulate further research on model monitoring, and pave the way for building more reliable ML models and monitoring tools in the future.
- Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Whang, and Martin Zinkevich. 2019. Data Validation for Machine Learning.. In MLSys.Google Scholar
- Graham Cormode, Zohar Karnin, Edo Liberty, Justin Thaler, and Pavel Vesely. 2021. Relative Error Streaming Quantiles. In PODS.Google Scholar
- Jordan Edwards et al. 2021. MLOps: Model management, deployment, lineage, and monitoring with Azure Machine Learning. https://tinyurl.com/57y8rrecGoogle Scholar
- Fiddler. 2022. Explainable Monitoring. https://www.fiddler.ai/ml-monitoringGoogle Scholar
- João Gama, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A survey on concept drift adaptation. ACM Computing Surveys (CSUR) 46, 4 (2014), 1--37.Google ScholarDigital Library
- Saurabh Garg, Yifan Wu, Sivaraman Balakrishnan, and Zachary Lipton. 2020. A Unified View of Label Shift Estimation. In NeurIPS.Google Scholar
- Michaela Hardt, Xiaoguang Chen, Xiaoyi Cheng, Michele Donini, Jason Gelman, Satish Gollaprolu, John He, Pedro Larroy, Xinyu Liu, Nick McCarthy, Ashish Rathi, Scott Rees, Ankit Siva, ErhYuan Tsai, Keerthan Vasist, Pinar Yilmaz, Muhammad Bilal Zafar, Sanjiv Das, Kevin Haas, Tyler Hill, and Krishnaram Kenthapadi. 2021. Amazon SageMaker Clarify: Machine Learning Bias Detection and Explainability in the Cloud. In KDD.Google Scholar
- Zohar Karnin, Kevin Lang, and Edo Liberty. 2016. Optimal quantile approximation in streams. In FOCS.Google Scholar
- Eren Kurshan, Hongda Shen, and Jiahao Chen. 2020. Towards self-regulating AI: Challenges and opportunities of AI model governance in financial services. In Proceedings of the First ACM International Conference on AI in Finance.Google ScholarDigital Library
- Zachary Lipton, Yu-Xiang Wang, and Alexander Smola. 2018. Detecting and correcting for label shift with black box predictors. In ICML.Google Scholar
- David Nigenda, Zohar Karnin, Muhammad Bilal Zafar, Raghu Ramesha, Alan Tan, Michele Donini, and Krishnaram Kenthapadi. 2022. Amazon SageMaker Model Monitor: A System for Real-Time Insights into Deployed Machine Learning Models. In KDD.Google Scholar
- Sashank Reddi, Barnabas Poczos, and Alex Smola. 2015. Doubly robust covariate shift correction. In AAAI.Google Scholar
- Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. Automating large-scale data quality verification. In VLDB.Google Scholar
- David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. In NeurIPS.Google Scholar
- Murtuza N Shergadwala, Himabindu Lakkaraju, and Krishnaram Kenthapadi. 2022. A Human-Centric Take on Model Monitoring. In ICML Workshop on Human-Machine Collaboration and Teaming (HMCaT).Google Scholar
- Ankur Taly, Kaz Sato, and Claudiu Gruia. 2021. Monitoring feature attributions: How Google saved one of the largest ML services in trouble. https://tinyurl. com/awt3f5ex Google Cloud Blog.Google Scholar
- TruEra. 2022. TruEra Monitoring. https://truera.com/monitoring/Google Scholar
- Alexey Tsymbal. 2004. The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin 106, 2 (2004), 58.Google Scholar
- Geoffrey I Webb, Roy Hyde, Hong Cao, Hai Long Nguyen, and Francois Petitjean. 2016. Characterizing concept drift. Data Mining and Knowledge Discovery 30, 4 (2016), 964--994.Google ScholarDigital Library
- Yifan Wu, Ezra Winston, Divyansh Kaushik, and Zachary Lipton. 2019. Domain adaptation with asymmetrically-relaxed distribution alignment. In ICML.Google Scholar
- Mykola Pechenizkiy, and Joao Gama. 2016. An overview of concept drift applications. Big data analysis: new algorithms for a new society (2016), 91--114.Google Scholar
Index Terms
- Model Monitoring in Practice: Lessons Learned and Open Challenges
Recommendations
Generative AI meets Responsible AI: Practical Challenges and Opportunities
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data MiningGenerative AI models and applications are being rapidly developed and deployed across a wide spectrum of industries and applications ranging from writing and email assistants to graphic design and art generation to educational assistants to coding to ...
Seamful XAI: Operationalizing Seamful Design in Explainable AI
CSCWMistakes in AI systems are inevitable, arising from both technical limitations and sociotechnical gaps. While black-boxing AI systems can make the user experience seamless, hiding the seams risks disempowering users to mitigate fallouts from AI mistakes. ...
Explainable AI (XAI): A systematic meta-survey of current challenges and future opportunities
AbstractThe past decade has seen significant progress in artificial intelligence (AI), which has resulted in algorithms being adopted for resolving a variety of problems. However, this success has been met by increasing model complexity and ...
Comments