skip to main content
10.1145/3626246.3654680acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
tutorial

Applications and Computation of the Shapley Value in Databases and Machine Learning

Published: 09 June 2024 Publication History

Abstract

Recently, the Shapley value, a concept rooted in cooperative game theory, has found more and more applications in databases and machine learning. Due to its combinatoric nature, the computation of the Shapley value is #P-hard. To address this challenge, numerous studies are actively engaged in developing efficient computation methods or exploring alternative solutions in specific application contexts. Applications of the Shapley value in databases and machine learning as well as fast computation or approximation of the Shapley value in those applications are becoming a new research frontier in the database community. This tutorial presents a comprehensive and systematic overview of Shapley value applications and computation within both database and machine learning domains. We survey the existing methods from a unique perspective that diverges from the current literature. Unlike most reviews, which mainly focus on applications, our approach focuses on the underlying algorithmic mechanisms and application specific assumptions in these methods. This approach allows us to highlight the similarities and differences among the various Shapley value applications and computation techniques more effectively. Our tutorial categorizes these methods based on their intrinsic processes, cutting across different applications. The tutorial begins with an introduction to the Shapley value and its diverse applications in databases and machine learning. Subsequently, it delves into the computational challenges of the Shapley value, presents cutting-edge solutions for its efficient computation, and explores alternative solutions.

References

[1]
Omer Abramovich, Daniel Deutch, Nave Frost, Ahmet Kara, and Dan Olteanu. 2023. Banzhaf Values for Facts in Query Answering. CoRR abs/2308.05588 (2023). https://doi.org/10.48550/ARXIV.2308.05588 arXiv:2308.05588 (To Appear in SIGMOD 2024).
[2]
Leopoldo E. Bertossi, Benny Kimelfeld, Ester Livshits, and Mikaël Monet. 2023. The Shapley Value in Database Management. SIGMOD Rec. 52, 2 (2023), 6--17. https://doi.org/10.1145/3615952.3615954
[3]
Meghyn Bienvenu, Diego Figueira, and Pierre Lafourcade. 2023. When is Shapley Value Computation a Matter of Counting? CoRR abs/2312.14529 (2023). https://doi.org/10.48550/ARXIV.2312.14529 arXiv:2312.14529 (To Appear in PODs 2024).
[4]
Jan C. Bioch. 2002. Modular Decomposition of Boolean Functions. https://ssrn.com/abstract=370984.
[5]
Jan C. Bioch. 2005. The complexity of modular decomposition of Boolean functions. Discret. Appl. Math. 149, 1--3 (2005), 1--13. https://doi.org/10.1016/J.DAM.2003.12.010
[6]
Mark Alexander Burgess and Archie C. Chapman. 2021. Approximating the Shapley Value Using Stratified Empirical Bernstein Sampling. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19--27 August 2021, Zhi-Hua Zhou (Ed.). ijcai.org, 73--81. https://doi.org/10.24963/IJCAI.2021/11
[7]
Javier Castro, Daniel Gómez, Elisenda Molina, and Juan Tejada. 2017. Improving polynomial estimation of the Shapley value by stratified random sampling with optimum allocation. Comput. Oper. Res. 82 (2017), 180--188. https://doi.org/10.1016/J.COR.2017.01.019
[8]
Javier Castro, Daniel Gómez, and Juan Tejada. 2009. Polynomial calculation of the Shapley value based on sampling. Comput. Oper. Res. 36, 5 (2009), 1726--1730. https://doi.org/10.1016/J.COR.2008.04.004
[9]
Satya R. Chakravarty, Manipushpak Mitra, and Palash Sarkar. 2014. A Course on Cooperative Game Theory. Cambridge University Press. https://doi.org/10.1017/CBO9781107415997
[10]
Georgios Chalkiadakis, Edith Elkind, and Michael J. Wooldridge. 2011. Computational Aspects of Cooperative Game Theory. Morgan & Claypool Publishers. https://doi.org/10.2200/S00355ED1V01Y201107AIM016
[11]
Hugh Chen, Ian C. Covert, Scott M. Lundberg, and Su-In Lee. 2023. Algorithms to estimate Shapley value feature attributions. Nat. Mac. Intell. 5, 6 (2023), 590--601. https://doi.org/10.1038/S42256-023-00657-X
[12]
Shay B. Cohen, Gideon Dror, and Eytan Ruppin. 2007. Feature Selection via Coalitional Game Theory. Neural Comput. 19, 7 (2007), 1939--1961. https://doi.org/10.1162/NECO.2007.19.7.1939
[13]
Zicun Cong, Xuan Luo, Jian Pei, Feida Zhu, and Yong Zhang. 2022. Data pricing in machine learning pipelines. Knowl. Inf. Syst. 64, 6 (2022), 1417--1455. https://doi.org/10.1007/S10115-022-01679--4
[14]
R. Dennis Cook and Sanford Weisberg. 1980. Characterizations of an Empirical Influence Function for Detecting Influential Cases in Regression. Technometrics 22, 4 (1980), 495--508. https://doi.org/10.1080/00401706.1980.10486199 arXiv:https://www.tandfonline.com/doi/pdf/10.1080/00401706.1980.10486199
[15]
Ian Covert and Su-In Lee. 2021. Improving KernelSHAP: Practical Shapley Value Estimation Using Linear Regression. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 130), Arindam Banerjee and Kenji Fukumizu (Eds.). PMLR, 3457--3465. https://proceedings.mlr.press/v130/covert21a.html
[16]
Nilesh N. Dalvi and Dan Suciu. 2007. Efficient query evaluation on probabilistic databases. VLDB J. 16, 4 (2007), 523--544. https://doi.org/10.1007/S00778-006-0004--3
[17]
Nilesh N. Dalvi and Dan Suciu. 2012. The dichotomy of probabilistic inference for unions of conjunctive queries. J. ACM 59, 6 (2012), 30:1--30:87. https://doi.org/10.1145/2395116.2395119
[18]
Xiaotie Deng and Christos H. Papadimitriou. 1994. On the Complexity of Co-operative Solution Concepts. Mathematics of Operations Research 19, 2 (1994), 257--266. http://www.jstor.org/stable/3690220
[19]
Daniel Deutch, Nave Frost, Benny Kimelfeld, and Mikaël Monet. 2022. Computing the Shapley Value of Facts in Query Answering. In SIGMOD '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, Zachary G. Ives, Angela Bonifati, and Amr El Abbadi (Eds.). ACM, 1570--1583. https://doi.org/10.1145/3514221.3517912
[20]
Ulrich Faigle and Walter Kern. 1992. The Shapley value for cooperative games under precedence constraints. International Journal of Game Theory 21 (1992), 249--266.
[21]
S. Shaheen Fatima, Michael J. Wooldridge, and Nicholas R. Jennings. 2008. A linear approximation method for the Shapley value. Artif. Intell. 172, 14 (2008), 1673--1699. https://doi.org/10.1016/J.ARTINT.2008.05.003
[22]
Robert Fink, Jiewen Huang, and Dan Olteanu. 2013. Anytime approximation in probabilistic databases. VLDB J. 22, 6 (2013), 823--848. https://doi.org/10.1007/S00778-013-0310--5
[23]
Daniel Vidali Fryer, Inga Strümke, and Hien D. Nguyen. 2021. Shapley Values for Feature Selection: The Good, the Bad, and the Axioms. IEEE Access 9 (2021), 144352--144360. https://doi.org/10.1109/ACCESS.2021.3119110
[24]
Amirata Ghorbani and James Y. Zou. 2019. Data Shapley: Equitable Valuation of Data for Machine Learning. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9--15 June 2019, Long Beach, California, USA (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 2242--2251. http://proceedings.mlr.press/v97/ghorbani19c.html
[25]
Anthony Hunter and Sébastien Konieczny. 2010. On the measure of conflicts: Shapley Inconsistency Values. Artif. Intell. 174, 14 (2010), 1007--1026. https://doi.org/10.1016/J.ARTINT.2010.06.001
[26]
Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gürel, Bo Li, Ce Zhang, Costas J. Spanos, and Dawn Song. 2019. Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms. Proc. VLDB Endow. 12, 11 (2019), 1610--1623. https://doi.org/10.14778/3342263.3342637
[27]
Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J. Spanos. 2019. Towards Efficient Data Valuation Based on the Shapley Value. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16--18 April 2019, Naha, Okinawa, Japan (Proceedings of Machine Learning Research, Vol. 89), Kamalika Chaudhuri and Masashi Sugiyama (Eds.). PMLR, 1167--1176. http://proceedings.mlr.press/v89/jia19a.html
[28]
Ahmet Kara, Dan Olteanu, and Dan Suciu. 2023. From Shapley Value to Model Counting and Back. CoRR abs/2306.14211 (2023). https://doi.org/10.48550/ARXIV.2306.14211 arXiv:2306.14211 (To Appear in PODs 2024).
[29]
Adam Karczmarz, Tomasz P. Michalak, Anish Mukherjee, Piotr Sankowski, and Piotr Wygocki. 2022. Improved feature importance computation for tree models based on the Banzhaf value. In Uncertainty in Artificial Intelligence, Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, UAI 2022, 1--5 August 2022, Eindhoven, The Netherlands (Proceedings of Machine Learning Research, Vol. 180), James Cussens and Kun Zhang (Eds.). PMLR, 969--979. https://proceedings.mlr.press/v180/karczmarz22a.html
[30]
Pratik Karmakar, Mikaël Monet, Pierre Senellart, and Stéphane Bressan. 2024. Expected Shapley-Like Scores of Boolean Functions: Complexity and Applications to Probabilistic Databases. CoRR abs/2401.06493 (2024). https://doi.org/10.48550/ARXIV.2401.06493 arXiv:2401.06493 (To Appear in PODs 2024).
[31]
Batya Kenig and Dan Suciu. 2021. A Dichotomy for the Generalized Model Counting Problem for Unions of Conjunctive Queries. In PODS'21: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Virtual Event, China, June 20--25, 2021, Leonid Libkin, Reinhard Pichler, and Paolo Guagliardo (Eds.). ACM, 312--324. https://doi.org/10.1145/3452021.3458313
[32]
Pang Wei Koh and Percy Liang. 2017. Understanding Black-box Predictions via Influence Functions. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6--11 August 2017 (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 1885--1894. http://proceedings.mlr.press/v70/koh17a.html
[33]
Ester Livshits, Leopoldo E. Bertossi, Benny Kimelfeld, and Moshe Sebag. 2020. The Shapley Value of Tuples in Query Answering. In 23rd International Conference on Database Theory, ICDT 2020, March 30-April 2, 2020, Copenhagen, Denmark (LIPIcs, Vol. 155), Carsten Lutz and Jean Christoph Jung (Eds.). Schloss Dagstuhl - Leibniz- Zentrum für Informatik, 20:1--20:19. https://doi.org/10.4230/LIPICS.ICDT.2020.20
[34]
Ester Livshits and Benny Kimelfeld. 2021. The Shapley Value of Inconsistency Measures for Functional Dependencies. In 24th International Conference on Database Theory, ICDT 2021, March 23--26, 2021, Nicosia, Cyprus (LIPIcs, Vol. 186), Ke Yi and Zhewei Wei (Eds.). Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 15:1--15:19. https://doi.org/10.4230/LIPICS.ICDT.2021.15
[35]
Scott M Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. 2020. From local explanations to global understanding with explainable AI for trees. Nature machine intelligence 2, 1 (2020), 56--67.
[36]
Scott M. Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 4765--4774. https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html
[37]
Xuan Luo, Jian Pei, Zicun Cong, and Cheng Xu. 2022. On Shapley Value in Data Assemblage Under Independent Utility. Proc. VLDB Endow. 15, 11 (2022), 2761--2773. https://doi.org/10.14778/3551793.3551829
[38]
Xuan Luo, Jian Pei, Cheng Xu, Wenjie Zhang, and Jianliang Xu. 2024. Fast Shapley Value Computation in Data Assemblage Tasks as Cooperative Simple Games. Proc. ACM Manag. Data 2, 1 (2024), 56:1--56:28. https://doi.org/10.1145/3639311
[39]
Sasan Maleki, Long Tran-Thanh, Greg Hines, Talal Rahwan, and Alex Rogers. 2013. Bounding the Estimation Error of Sampling-based Shapley Value Approximation With/Without Stratifying. CoRR abs/1306.4265 (2013). arXiv:1306.4265 http://arxiv.org/abs/1306.4265
[40]
Rory Mitchell, Joshua Cooper, Eibe Frank, and Geoffrey Holmes. 2022. Sampling Permutations for Shapley Value Estimation. J. Mach. Learn. Res. 23 (2022), 43:1--43:46. http://jmlr.org/papers/v23/21-0439.html
[41]
Mikaël Monet. 2020. Solving a Special Case of the Intensional vs Extensional Conjecture in Probabilistic Databases. In Proceedings of the 39th ACM SIGMOD- SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2020, Portland, OR, USA, June 14--19, 2020, Dan Suciu, Yufei Tao, and Zhewei Wei (Eds.). ACM, 149--163. https://doi.org/10.1145/3375395.3387642
[42]
Ramin Okhrati and Aldo Lipani. 2020. A Multilinear Sampling Algorithm to Estimate Shapley Values. In 25th International Conference on Pattern Recognition, ICPR 2020, Virtual Event / Milan, Italy, January 10--15, 2021. IEEE, 7992--7999. https://doi.org/10.1109/ICPR48806.2021.9412511
[43]
Art B Owen. 2003. Quasi-monte carlo sampling. Monte Carlo Ray Tracing: Siggraph 1 (2003), 69--88.
[44]
Guillermo Owen. 1972. Multilinear extensions of games. Management Science 18, 5-part-2 (1972), 64--79.
[45]
Benedek Rozemberczki and Rik Sarkar. 2021. The Shapley Value of Classifiers in Ensemble Games. In CIKM '21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1 - 5, 2021, Gianluca Demartini, Guido Zuccon, J. Shane Culpepper, Zi Huang, and Hanghang Tong (Eds.). ACM, 1558--1567. https://doi.org/10.1145/3459637.3482302
[46]
Benedek Rozemberczki, Lauren Watson, Péter Bayer, Hao-Tsung Yang, Oliver Kiss, Sebastian Nilsson, and Rik Sarkar. 2022. The Shapley Value in Machine Learning. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23--29 July 2022, Luc De Raedt (Ed.). ijcai.org, 5572--5579. https://doi.org/10.24963/IJCAI.2022/778
[47]
Luis M Ruiz, Federico Valenciano, and Jose M Zarzuelo. 1998. The family of least square values for transferable utility games. Games and Economic Behavior 24, 1--2 (1998), 109--130.
[48]
Lloyd S. Shapley. 1952. A Value for n-Person Games. Technical Report P-295. RAND Corporation, Santa Monica, CA.
[49]
Ian H. Sloan. 1993. Random Number Generation and Quasi-Monte Carlo Methods (H, Niederreiter). SIAM Rev. 35, 4 (1993), 680--681. https://doi.org/10.1137/1035170
[50]
Jiachen T. Wang and Ruoxi Jia. 2023. Data Banzhaf: A Robust Data Valuation Framework for Machine Learning. In International Conference on Artificial Intelligence and Statistics, 25--27 April 2023, Palau de Congressos, Valencia, Spain (Proceedings of Machine Learning Research, Vol. 206), Francisco J. R. Ruiz, Jennifer G. Dy, and Jan-Willem van de Meent (Eds.). PMLR, 6388--6421. https://proceedings.mlr.press/v206/wang23e.html
[51]
Haocheng Xia, Jinfei Liu, Jian Lou, Zhan Qin, Kui Ren, Yang Cao, and Li Xiong. 2023. Equitable Data Valuation Meets the Right to Be Forgotten in Model Markets. Proc. VLDB Endow. 16, 11 (2023), 3349--3362. https://doi.org/10.14778/3611479.3611531
[52]
Tom Yan and Ariel D. Procaccia. 2021. If You Like Shapley Then You'll Love the Core. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2--9, 2021. AAAI Press, 5751--5759. https://doi.org/10.1609/AAAI.V35I6.16721
[53]
Jiayao Zhang, Qiheng Sun, Jinfei Liu, Li Xiong, Jian Pei, and Kui Ren. 2023. Efficient Sampling Approaches to Shapley Value Approximation. Proc. ACM Manag. Data 1, 1 (2023), 48:1--48:24. https://doi.org/10.1145/3588728
[54]
Jiayao Zhang, Haocheng Xia, Qiheng Sun, Jinfei Liu, Li Xiong, Jian Pei, and Kui Ren. 2023. Dynamic Shapley Value Computation. In 39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA, April 3--7, 2023. IEEE, 639--652. https://doi.org/10.1109/ICDE55515.2023.00055
[55]
Yingbo Zhou, Utkarsh Porwal, Ce Zhang, Hung Q. Ngo, Long Nguyen, Christopher Ré, and Venu Govindaraju. 2014. Parallel Feature Selection Inspired by Group Testing. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8--13 2014, Montreal, Quebec, Canada, Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger (Eds.). 3554--3562. https://proceedings.neurips.cc/paper/2014/hash/fb8feff253bb6c834deb61ec76baa893-Abstract.html

Cited By

View all
  • (2025)Shapley Value Estimation based on Differential MatrixProceedings of the ACM on Management of Data10.1145/37097253:1(1-28)Online publication date: 11-Feb-2025
  • (2024)Shapley Values in Classification Problems with Triadic Formal Concept AnalysisConceptual Knowledge Structures10.1007/978-3-031-67868-4_6(83-96)Online publication date: 9-Sep-2024

Index Terms

  1. Applications and Computation of the Shapley Value in Databases and Machine Learning

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD/PODS '24: Companion of the 2024 International Conference on Management of Data
      June 2024
      694 pages
      ISBN:9798400704222
      DOI:10.1145/3626246
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 09 June 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Shapley value
      2. cooperative game theory
      3. data market
      4. databases
      5. machine learning

      Qualifiers

      • Tutorial

      Funding Sources

      • NSERC Discovery Grant
      • Beyond the Horizon Grant by Duke University

      Conference

      SIGMOD/PODS '24
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)271
      • Downloads (Last 6 weeks)26
      Reflects downloads up to 01 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Shapley Value Estimation based on Differential MatrixProceedings of the ACM on Management of Data10.1145/37097253:1(1-28)Online publication date: 11-Feb-2025
      • (2024)Shapley Values in Classification Problems with Triadic Formal Concept AnalysisConceptual Knowledge Structures10.1007/978-3-031-67868-4_6(83-96)Online publication date: 9-Sep-2024

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media