ABSTRACT
Quality assurance is required for the wide use of artificial intelligence (AI) systems in industry and society, including mission-critical areas such as medical or disaster management domains. However, the quality evaluation methods of machine learning (ML) components, especially deep neural networks, have not yet been established. In addition, various metrics are applied by evaluators with different quality requirements and testing environments, from data collection to experimentation to deployment. In this paper, we propose a quality provenance model, AIQPROV, to record who evaluated quality, when from which viewpoint, and how the evaluation was used. The AIQPROV model focuses on human activities on how to apply this to the field of quality assurance, where human intervention is required. Moreover, we present an extension of the W3C PROV framework and conduct a database to store the provenance information of the quality assurance lifecycle with 11 use cases to validate our model.
Supplemental Material
- Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software engineering for machine learning: A case study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 291–300.Google ScholarDigital Library
- Ekaba Bisong. 2019. Kubeflow and Kubeflow Pipelines. Apress, Berkeley, CA, 671–685. https://doi.org/10.1007/978-1-4842-4470-8_46Google Scholar
- Souti Chattopadhyay, Ishita Prasad, Austin Z Henley, Anita Sarma, and Titus Barik. 2020. What’s wrong with computational notebooks? Pain points, needs, and design opportunities. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12.Google ScholarDigital Library
- Committee for machine learning quality management. 2020. Machine Learning Quality Management Guideline v1.0.1. Technical Report. National Institute of Advanced Industrial Science and Technology(AIST). https://www.cpsec.aist.go.jp/achievements/aiqm/AIQM-Guideline-1.0.1-en.pdfGoogle Scholar
- Sato Danilo and Windheuser Christoph Wider Arif. 2019. Continuous Delivery for Machine Learning. Retrieved 2022-06-27 from https://martinfowler.com/articles/cd4ml.htmlGoogle Scholar
- Michael Felderer and Rudolf Ramler. 2021. Quality Assurance for AI-Based Systems: Overview and Challenges (Introduction to Interactive Session). In International Conference on Software Quality. Springer, 33–42.Google ScholarCross Ref
- B.P. Harenslak and J. de Ruiter. 2021. Data Pipelines with Apache Airflow. Manning. https://books.google.co.jp/books?id=8EwnEAAAQBAJGoogle Scholar
- Ben Hutchinson, Negar Rostamzadeh, Christina Greer, Katherine Heller, and Vinodkumar Prabhakaran. 2022. Evaluation Gaps in Machine Learning Practice. In 2022 ACM Conference on Fairness, Accountability, and Transparency (Seoul, Republic of Korea) (FAccT ’22). Association for Computing Machinery, New York, NY, USA, 1859–1876. https://doi.org/10.1145/3531146.3533233Google ScholarDigital Library
- ISO/IEC 25010. 2011. ISO/IEC 25010:2011, Systems and software engineering ? Systems and software Quality Requirements and Evaluation (SQuaRE) ? System and software quality models. Standard. ISO.Google Scholar
- Dominik Kerzel, Sheeba Samuel, and Birgitta König-Ries. 2021. Towards Tracking Provenance from Machine Learning Notebooks.. In KDIR. 274–281.Google Scholar
- Hiroshi Kuwajima and Fuyuki Ishikawa. 2019. Adapting SQuaRE for Quality Assessment of Artificial Intelligence Systems. In IEEE International Symposium on Software Reliability Engineering Workshops, ISSRE Workshops 2019, Berlin, Germany, October 27-30, 2019, Katinka Wolter, Ina Schieferdecker, Barbara Gallina, Michel Cukier, Roberto Natella, Naghmeh Ramezani Ivaki, and Nuno Laranjeiro (Eds.). IEEE, 13–18. https://doi.org/10.1109/ISSREW.2019.00035Google Scholar
- Timothy Lebo, Satya Sahoo, Deborah McGuinness, Khalid Belhajjame, James Cheney, David Corsar, Daniel Garijo, Stian Soiland-Reyes, Stephan Zednik, and Jun Zhao. 2013. Prov-o: The prov ontology. W3C Recommendation. World Wide Web Consortium(2013). https://www.w3.org/TR/prov-o/Google Scholar
- Dusica Marijan, Arnaud Gotlieb, and Mohit Kumar Ahuja. 2019. Challenges of Testing Machine Learning Based Systems. In 2019 IEEE International Conference On Artificial Intelligence Testing (AITest). 101–102. https://doi.org/10.1109/AITest.2019.00010Google Scholar
- Akshay Naresh Modi, Chiu Yuen Koo, Chuan Yu Foo, Clemens Mewald, Denis M. Baylor, Eric Breck, Heng-Tze Cheng, Jarek Wilkiewicz, Levent Koc, Lukasz Lew, Martin A. Zinkevich, Martin Wicke, Mustafa Ispir, Neoklis Polyzotis, Noah Fiedel, Salem Elie Haykal, Steven Whang, Sudip Roy, Sukriti Ramesh, Vihan Jain, Xin Zhang, and Zakaria Haque. 2017. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. In KDD 2017.Google Scholar
- Kenichiro Narita, Michitaka Akita, Kyoung-Sook Kim, Yuta Iwase, Yuichi Watanaka, Takao Nakagawa, and Qiang Zhong. 2021. Qunomon: A FAIR testbed of quality evaluation for machine learning models. In 2021 28th Asia-Pacific Software Engineering Conference Workshops (APSEC Workshops). 21–24. https://doi.org/10.1109/APSECW53869.2021.00015Google ScholarCross Ref
- Ipek Ozkaya. 2020. What is really different in engineering AI-enabled systems?IEEE Software 37, 4 (2020), 3–6.Google Scholar
- Lukas Rupprecht, James C Davis, Constantine Arnold, Yaniv Gur, and Deepavali Bhagwat. 2020. Improving reproducibility of data science pipelines through transparent provenance capture. Proceedings of the VLDB Endowment 13, 12 (2020), 3354–3368.Google ScholarDigital Library
- Sheeba Samuel, Frank Löffler, and Birgitta König-Ries. 2020. Machine learning pipelines: provenance, reproducibility and FAIR data principles. In Provenance and Annotation of Data and Processes. Springer, 226–230.Google Scholar
- David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. Advances in neural information processing systems 28 (2015).Google Scholar
- Micah J Smith, Carles Sala, James Max Kanter, and Kalyan Veeramachaneni. 2020. The machine learning bazaar: Harnessing the ml ecosystem for effective system development. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 785–800.Google ScholarDigital Library
- Renan Souza, Leonardo Azevedo, Vítor Lourenço, Elton Soares, Raphael Thiago, Rafael Brandão, Daniel Civitarese, Emilio Vital Brazil, Marcio Moreno, Patrick Valduriez, Marta Mattoso, Renato Cerqueira, and Marco Netto. 2019. Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering. In WORKS 2019 - Workflows in Support of Large-Scale Science co-located with SC 2019 - ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis. ACM, Denver, United States, 10. https://hal-lirmm.ccsd.cnrs.fr/lirmm-02335500Google Scholar
- Medha Umarji and Carolyn Seaman. 2009. Gauging Acceptance of Software Metrics: Comparing Perspectives of Managers and Developers. In Proceedings of the 2009 3rd International Symposium on Empirical Software Engineering and Measurement(ESEM ’09). IEEE Computer Society, USA, 236–247. https://doi.org/10.1109/ESEM.2009.5315999Google ScholarDigital Library
- Jie M Zhang, Mark Harman, Lei Ma, and Yang Liu. 2020. Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering(2020).Google ScholarDigital Library
Index Terms
- How Provenance helps Quality Assurance Activities in AI/ML Systems
Recommendations
User-Perceived Quality of Interactive Systems
User-perceived quality of interactive systems is defined in terms of statistically nonoverlapping categories, so-called dimensions or factors Categories are identified by factor analysis and represent a dimensional concept of the quality of interactive ...
Optimizing Quality Assurance Strategies through an Integrated Quality Assurance Approach -- Guiding Quality Assurance with Assumptions and Selection Rules
SEAA '14: Proceedings of the 2014 40th EUROMICRO Conference on Software Engineering and Advanced ApplicationsQuality assurance activities are often still expensive or do not offer the expected quality. A recent trend aimed at overcoming this problem is tighter integration of several quality assurance techniques such as analysis and testing in order to exploit ...
Improving the ROI of software quality assurance activities: an empirical study
ICSP'10: Proceedings of the 2010 international conference on New modeling concepts for today's software processes: software processReview, process audit, and testing are three main Quality Assurance activities during the software development life cycle. They complement each other to examine work products for defects and improvement opportunities to the largest extent. Understanding ...
Comments