skip to main content
research-article

Data-Sharing Markets: Model, Protocol, and Algorithms to Incentivize the Formation of Data-Sharing Consortia

Published:20 June 2023Publication History
Skip Abstract Section

Abstract

Organizations that would mutually benefit from pooling their data are otherwise wary of sharing. This is because sharing data is costly-in time and effort-and, at the same time, the benefits of sharing are not clear. Without a clear cost-benefit analysis, participants default in not sharing. As a consequence, many opportunities to create valuable data-sharing consortia never materialize, and the value of data remains locked.

We introduce a new sharing model, market protocol, and algorithms to incentivize the creation of data-sharing markets. The combined contributions of this paper, which we call DSC, incentivize the creation of data-sharing markets that unleash the value of data for its participants. The sharing model introduces two incentives; one that guarantees that participating is better than not doing so and another that compensates participants according to how valuable their data is. Because operating the consortia is costly, we are also concerned with ensuring its operation is sustainable: we design a protocol that ensures that a valuable data-sharing consortium forms when it is sustainable.

We introduce algorithms to elicit the value of data from the participants, which is used first to cover the costs of operating the consortia and second to compensate for data contributions. For the latter, we challenge using the Shapley value to allocate revenue. We offer analytical and empirical evidence for this and introduce an alternative method that compensates participants better and leads to the formation of data-sharing consortia.

References

  1. Daniel Abadi, Owen Arden, Faisal Nawab, and Moshe Shadmon. 2020. Anylog: a grand unification of the internet of things. In Conference on Innovative Data Systems Research (CIDR ?20).Google ScholarGoogle Scholar
  2. Jacob D Abernethy, Rachel Cummings, Bhuvesh Kumar, Sam Taggart, and Jamie H Morgenstern. 2019. Learning auctions with robust incentive guarantees. Advances in Neural Information Processing Systems 32 (2019).Google ScholarGoogle Scholar
  3. Daron Acemoglu, Ali Makhdoumi, Azarakhsh Malekian, and Asuman Ozdaglar. 2019. Too much data: Prices and inefficiencies in data markets. Technical Report. National Bureau of Economic Research.Google ScholarGoogle Scholar
  4. Anish Agarwal, Munther Dahleh, and Tuhin Sarkar. 2019. A marketplace for data: An algorithmic solution. In Proceedings of the 2019 ACM Conference on Economics and Computation. 701--726.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Claudia Allen, Terrisca R Des Jardins, Arvela Heider, Kristin A Lyman, Lee McWilliams, Alison L Rein, Abigail A Schachter, Ranjit Singh, Barbara Sorondo, Joan Topper, et al . 2014. Data governance and data sharing agreements for community-wide health information exchange: lessons from the beacon communities. EGEMS 2, 1 (2014).Google ScholarGoogle Scholar
  6. Nuno Antonio, Ana de Almeida, and Luis Nunes. 2019. Hotel booking demand datasets. Data in brief 22 (2019), 41--49.Google ScholarGoogle Scholar
  7. Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. 2021. Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. In Proceedings of CIDR.Google ScholarGoogle Scholar
  8. Kenneth Arrow. 1962. Economic welfare and the allocation of resources for invention. In The rate and direction of inventive activity: Economic and social factors. Princeton University Press, 609--626.Google ScholarGoogle Scholar
  9. Lawrence M Ausubel and Peter Cramton. 2002. Demand reduction and inefficiency in multi-unit auctions. (2002).Google ScholarGoogle Scholar
  10. Amazon AWS. 2022. Amazon AWS Instance Types. https://aws.amazon.com/ec2/instance-types/Google ScholarGoogle Scholar
  11. Shaimaa Bajoudah, Dong Changyu, and Paolo Missier. 2019. Toward a Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blockchain. In Procs. 2nd IEEE International Conference on Blockchain (Blockchain 2019). IEEE, Atlanta, USA.Google ScholarGoogle ScholarCross RefCross Ref
  12. Johes Bater, Gregory Elliott, Craig Eggen, Satyender Goel, Abel Kho, and Jennie Rogers. 2016. SMCQL: Secure querying for federated databases. arXiv preprint arXiv:1606.06808 (2016).Google ScholarGoogle Scholar
  13. Johes Bater, Yongjoo Park, Xi He, Xiao Wang, and Jennie Rogers. 2020. Saqe: practical privacy-preserving approximate query processing for data federations. Proceedings of the VLDB Endowment 13, 12 (2020), 2691--2705.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Anant Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J Elmore, Samuel Madden, and Aditya G Parameswaran. [n. d.]. Datahub: Collaborative data science & dataset version management at scale. ([n. d.]).Google ScholarGoogle Scholar
  15. Christine L Borgman. 2012. The conundrum of sharing research data. Journal of the American Society for Information Science and Technology 63, 6 (2012), 1059--1078.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Steven J Brams, Steven John Brams, and Alan D Taylor. 1996. Fair Division: From cake-cutting to dispute resolution. Cambridge University Press.Google ScholarGoogle Scholar
  17. Anna L Buczak and Erhan Guven. 2015. A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Communications surveys & tutorials 18, 2 (2015), 1153--1176.Google ScholarGoogle Scholar
  18. Raul Castro Fernandez. 2022. Protecting Data Markets from Strategic Participants. (2022).Google ScholarGoogle Scholar
  19. Victor Chernozhukov, Hiroyuki Kasahara, and Paul Schrimpf. 2021. Causal impact of masks, policies, behavior on early covid-19 pandemic in the US. Journal of econometrics 220, 1 (2021), 23--62.Google ScholarGoogle ScholarCross RefCross Ref
  20. Rada Chirkova, Jun Yang, et al . 2012. Materialized views. Foundations and Trends® in Databases 4, 4 (2012), 295--405.Google ScholarGoogle Scholar
  21. Feature Cloud. 2022. Transforming medical research with federated learning. https://featurecloud.eu/about/our-vision/Google ScholarGoogle Scholar
  22. Thomas M Cover. 1999. Elements of information theory. John Wiley & Sons.Google ScholarGoogle Scholar
  23. Ronald Cramer, Ivan Bjerre Damgård, et al. 2015. Secure multiparty computation. Cambridge University Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. RALPH D'AGOSTINO and Egon S Pearson. 1973. Tests for departure from normality. Empirical results for the distributions of b 2 and sqrt(b). Biometrika 60, 3 (1973), 613--622.Google ScholarGoogle Scholar
  25. datacoop 2021. Mozilla Research. Shifting power through data governance. https://foundation.mozilla.org/en/data-futures-lab/data-for-empowerment/shifting-power-through-data-governance/.Google ScholarGoogle Scholar
  26. datadividend 2021. Data Dividend, My data, my money. https://www.datadividendproject.com/.Google ScholarGoogle Scholar
  27. Sylvie Delacroix and Neil D Lawrence. 2019. Bottom-up data Trusts: disturbing the ?one size fits all'approach to data governance. International data privacy law 9, 4 (2019), 236--252.Google ScholarGoogle ScholarCross RefCross Ref
  28. Anna Veronika Dorogush, Vasily Ershov, and Andrey Gulin. 2018. CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363 (2018).Google ScholarGoogle Scholar
  29. Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/mlGoogle ScholarGoogle Scholar
  30. Muhammad El-Hindi, Carsten Binnig, Arvind Arasu, Donald Kossmann, and Ravi Ramamurthy. 2019. BlockchainDB: A shared database on blockchains. Proceedings of the VLDB Endowment 12, 11 (2019), 1597--1609.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. André Elisseeff, Massimiliano Pontil, et al . 2003. Leave-one-out error and stability of learning algorithms with applications. NATO science series sub series iii computer and systems sciences 190 (2003), 111--130.Google ScholarGoogle Scholar
  32. Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018. Aurum: A data discovery system. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 1001--1012.Google ScholarGoogle ScholarCross RefCross Ref
  33. Raul Castro Fernandez, Pranav Subramaniam, and Michael J Franklin. 2020. Data market platforms: Trading data assets to solve data problems. arXiv preprint arXiv:2002.01047 (2020).Google ScholarGoogle Scholar
  34. Amirata Ghorbani and James Zou. 2019. Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning. PMLR, 2242--2251.Google ScholarGoogle Scholar
  35. Andrew V Goldberg and Jason D Hartline. 2001. Competitive auctions for multiple digital goods. In European Symposium on Algorithms. Springer, 416--427.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Google. 2022. What-If Tool - People AI Research (PAIR). https://pair-code.github.io/what-if-tool/Google ScholarGoogle Scholar
  37. Robert L Grossman, Allison Heath, Mark Murphy, Maria Patterson, and Walt Wells. 2016. A case for data commons: toward data science as a service. Computing in science & engineering 18, 5 (2016), 10--20.Google ScholarGoogle Scholar
  38. Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. 2009. The elements of statistical learning: data mining, inference, and prediction. Vol. 2. Springer.Google ScholarGoogle Scholar
  39. Ling Huang, Anthony D Joseph, Blaine Nelson, Benjamin IP Rubinstein, and J Doug Tygar. 2011. Adversarial machine learning. In Proceedings of the 4th ACM workshop on Security and artificial intelligence. 43--58.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Zachary G Ives, Todd J Green, Grigoris Karvounarakis, Nicholas E Taylor, Val Tannen, Partha Pratim Talukdar, Marie Jacob, and Fernando Pereira. 2008. The orchestra collaborative data sharing system. ACM Sigmod Record 37, 3 (2008), 26--32.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Marijn Janssen, Yannis Charalabidis, and Anneke Zuiderwijk. 2012. Benefits, adoption barriers and myths of open data and open government. Information systems management 29, 4 (2012), 258--268.Google ScholarGoogle Scholar
  42. Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas Spanos, and Dawn Song. 2019. Efficient task-specific data valuation for nearest neighbor algorithms. Proceedings of the VLDB Endowment 12, 11 (2019), 1610--1623.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J Spanos. 2019. Towards efficient data valuation based on the shapley value. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 1167--1176.Google ScholarGoogle Scholar
  44. Charles I Jones and Christopher Tonetti. 2020. Nonrivalry and the Economics of Data. American Economic Review 110, 9 (2020), 2819--58.Google ScholarGoogle ScholarCross RefCross Ref
  45. Vanja Josifovski, Peter Schwarz, Laura Haas, and Eileen Lin. 2002. Garlic: a new flavor of federated query processing for DB2. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data. 524--532.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Kate Keahey, Jason Anderson, Zhuo Zhen, Pierre Riteau, Paul Ruth, Dan Stanzione, Mert Cevik, Jacob Colleran, Haryadi S. Gunawi, Cody Hammock, Joe Mambretti, Alexander Barnes, François Halbach, Alex Rocha, and Joe Stubbs. 2020. Lessons Learned from the Chameleon Testbed. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC '20). USENIX Association.Google ScholarGoogle Scholar
  47. Rob Kitchin. 2014. The data revolution: Big data, open data, data infrastructures and their consequences. Sage.Google ScholarGoogle Scholar
  48. Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. 2020. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine 37, 3 (2020), 50--60.Google ScholarGoogle ScholarCross RefCross Ref
  49. Yifan Li, Xiaohui Yu, and Nick Koudas. 2021. Data Acquisition for Improving Machine Learning Models. VLDB 14, 10 (jun 2021), 1832--1844.Google ScholarGoogle Scholar
  50. Qiongqiong Lin, Jiayao Zhang, Jinfei Liu, Kui Ren, Jian Lou, Junxu Liu, Li Xiong, Jian Pei, and Jimeng Sun. 2021. Demonstration of dealer: an end-to-end model marketplace with differential privacy. Proceedings of the VLDB Endowment 14, 12 (2021), 2747--2750.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Jinfei Liu, Jian Lou, Junxu Liu, Li Xiong, Jian Pei, and Jimeng Sun. 2021. Dealer: an end-to-end model marketplace with differential privacy. VLDB (2021).Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Yu-Chen Lo, Stefano E Rensi, Wen Torng, and Russ B Altman. 2018. Machine learning in chemoinformatics and drug discovery. Drug discovery today 23, 8 (2018), 1538--1546.Google ScholarGoogle Scholar
  53. RE Machol and J Rosenblatt. 1966. Confidence interval based on single observation. Proc. IEEE 54, 8 (1966), 1087--1088.Google ScholarGoogle ScholarCross RefCross Ref
  54. Roger B Myerson. 1981. Optimal auction design. Mathematics of operations research 6, 1 (1981), 58--73.Google ScholarGoogle Scholar
  55. Roger B Myerson and Mark A Satterthwaite. 1983. Efficient mechanisms for bilateral trading. Journal of economic theory 29, 2 (1983), 265--281.Google ScholarGoogle ScholarCross RefCross Ref
  56. Michael Naehrig, Kristin Lauter, and Vinod Vaikuntanathan. 2011. Can homomorphic encryption be practical?. In Proceedings of the 3rd ACM workshop on Cloud computing security workshop. 113--124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Fatemeh Nargesian, Erkang Zhu, Renée J Miller, Ken Q Pu, and Patricia C Arocena. 2019. Data lake management: challenges and opportunities. Proceedings of the VLDB Endowment 12, 12 (2019), 1986--1989.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. NIH. 2023. Final NIH Policy for Data Management and Sharing. https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.htmlGoogle ScholarGoogle Scholar
  59. Elinor Ostrom. 2008. Tragedy of the commons. The new palgrave dictionary of economics 2 (2008).Google ScholarGoogle Scholar
  60. Ippokratis Pandis. 2021. The evolution of Amazon redshift. Proceedings of the VLDB Endowment 14, 12 (2021), 3162--3174.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Eric A Posner and E Glen Weyl. 2019. Radical Markets. Princeton University Press.Google ScholarGoogle Scholar
  62. Swiss Re. 2022. Swiss Re to explore AI in reinsurance. https://www.lifeinsuranceinternational.com/news/swiss-re-webank/Google ScholarGoogle Scholar
  63. Alvin E Roth. 1988. The Shapley value: essays in honor of Lloyd S. Shapley. Cambridge University Press.Google ScholarGoogle Scholar
  64. Yexuan Shi, Yongxin Tong, Yuxiang Zeng, Zimu Zhou, Bolin Ding, and Lei Chen. 2021. Efficient Approximate Range Aggregation over Large-scale Spatial Data Federation. IEEE Transactions on Knowledge and Data Engineering (2021).Google ScholarGoogle ScholarCross RefCross Ref
  65. Congzheng Song, Thomas Ristenpart, and Vitaly Shmatikov. 2017. Machine learning models that remember too much. In Proceedings of the 2017 ACM SIGSAC Conference on computer and communications security. 587--601.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Vasilis Syrgkanis and Eva Tardos. 2013. Composable and efficient mechanisms. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing. 211--220.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Ming Tang and Vincent WS Wong. 2021. An incentive mechanism for cross-silo federated learning: A public goods perspective. In IEEE INFOCOM 2021-IEEE Conference on Computer Communications. IEEE, 1--10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Yongxin Tong, Xuchen Pan, Yuxiang Zeng, Yexuan Shi, Chunbo Xue, Zimu Zhou, Xiaofei Zhang, Lei Chen, Yi Xu, Ke Xu, et al. 2022. Hu-Fu: efficient and secure spatial queries over data federation. VLDB (2022).Google ScholarGoogle Scholar
  69. USGS. 2022. USGS Data-Sharing Agreement. https://www.usgs.gov/data-management/data-sharing-agreementsGoogle ScholarGoogle Scholar
  70. Melanie M Wall, James Boen, and Richard Tweedie. 2001. An effective confidence interval for the mean with samples of size one and two. The American Statistician 55, 2 (2001), 102--105.Google ScholarGoogle ScholarCross RefCross Ref
  71. Tianhao Wang, Johannes Rausch, Ce Zhang, Ruoxi Jia, and Dawn Song. 2020. A principled approach to data valuation for federated learning. In Federated Learning. Springer, 153--167.Google ScholarGoogle Scholar
  72. Siyuan Xia, Zhiru Zhu, Chris Zhu, Jinjin Zhao, Kyle Chard, Aaron J Elmore, Ian Foster, Michael Franklin, Sanjay Krishnan, and Raul Castro Fernandez. 2022. Data station: delegated, trustworthy, and auditable computation to enable data-sharing consortia with a data escrow. Proceedings of the VLDB Endowment 15, 11 (2022), 3172--3185.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Liqi Xu, Silu Huang, SiLi Hui, Aaron J Elmore, and Aditya Parameswaran. 2017. Orpheusdb: a lightweight approach to relational dataset versioning. In Proceedings of the 2017 ACM International Conference on Management of Data. 1655--1658.Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Rongfei Zeng, Chao Zeng, Xingwei Wang, Bo Li, and Xiaowen Chu. 2021. A comprehensive survey of incentive mechanism for federated learning. arXiv preprint arXiv:2106.15406 (2021).Google ScholarGoogle Scholar

Index Terms

  1. Data-Sharing Markets: Model, Protocol, and Algorithms to Incentivize the Formation of Data-Sharing Consortia

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the ACM on Management of Data
        Proceedings of the ACM on Management of Data  Volume 1, Issue 2
        PACMMOD
        June 2023
        2310 pages
        EISSN:2836-6573
        DOI:10.1145/3605748
        Issue’s Table of Contents

        Copyright © 2023 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 20 June 2023
        Published in pacmmod Volume 1, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader