research-article

Data-Sharing Markets: Model, Protocol, and Algorithms to Incentivize the Formation of Data-Sharing Consortia

Author:
Raul Castro Fernandez

The University of Chicago, Chicago, IL, USA

The University of Chicago, Chicago, IL, USA

0000-0001-7675-6080
View Profile

Proceedings of the ACM on Management of Data Volume 1 Issue 2Article No.: 172pp 1–25https://doi.org/10.1145/3589317

Published:20 June 2023Publication History

Proceedings of the ACM on Management of Data

Abstract

Organizations that would mutually benefit from pooling their data are otherwise wary of sharing. This is because sharing data is costly-in time and effort-and, at the same time, the benefits of sharing are not clear. Without a clear cost-benefit analysis, participants default in not sharing. As a consequence, many opportunities to create valuable data-sharing consortia never materialize, and the value of data remains locked.

We introduce a new sharing model, market protocol, and algorithms to incentivize the creation of data-sharing markets. The combined contributions of this paper, which we call DSC, incentivize the creation of data-sharing markets that unleash the value of data for its participants. The sharing model introduces two incentives; one that guarantees that participating is better than not doing so and another that compensates participants according to how valuable their data is. Because operating the consortia is costly, we are also concerned with ensuring its operation is sustainable: we design a protocol that ensures that a valuable data-sharing consortium forms when it is sustainable.

We introduce algorithms to elicit the value of data from the participants, which is used first to cover the costs of operating the consortia and second to compensate for data contributions. For the latter, we challenge using the Shapley value to allocate revenue. We offer analytical and empirical evidence for this and introduce an alternative method that compensates participants better and leads to the formation of data-sharing consortia.

References

Daniel Abadi, Owen Arden, Faisal Nawab, and Moshe Shadmon. 2020. Anylog: a grand unification of the internet of things. In Conference on Innovative Data Systems Research (CIDR ?20).Google Scholar
Jacob D Abernethy, Rachel Cummings, Bhuvesh Kumar, Sam Taggart, and Jamie H Morgenstern. 2019. Learning auctions with robust incentive guarantees. Advances in Neural Information Processing Systems 32 (2019).Google Scholar
Daron Acemoglu, Ali Makhdoumi, Azarakhsh Malekian, and Asuman Ozdaglar. 2019. Too much data: Prices and inefficiencies in data markets. Technical Report. National Bureau of Economic Research.Google Scholar
Anish Agarwal, Munther Dahleh, and Tuhin Sarkar. 2019. A marketplace for data: An algorithmic solution. In Proceedings of the 2019 ACM Conference on Economics and Computation. 701--726.Google ScholarDigital Library
Claudia Allen, Terrisca R Des Jardins, Arvela Heider, Kristin A Lyman, Lee McWilliams, Alison L Rein, Abigail A Schachter, Ranjit Singh, Barbara Sorondo, Joan Topper, et al . 2014. Data governance and data sharing agreements for community-wide health information exchange: lessons from the beacon communities. EGEMS 2, 1 (2014).Google Scholar
Nuno Antonio, Ana de Almeida, and Luis Nunes. 2019. Hotel booking demand datasets. Data in brief 22 (2019), 41--49.Google Scholar
Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. 2021. Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. In Proceedings of CIDR.Google Scholar
Kenneth Arrow. 1962. Economic welfare and the allocation of resources for invention. In The rate and direction of inventive activity: Economic and social factors. Princeton University Press, 609--626.Google Scholar
Lawrence M Ausubel and Peter Cramton. 2002. Demand reduction and inefficiency in multi-unit auctions. (2002).Google Scholar
Amazon AWS. 2022. Amazon AWS Instance Types. https://aws.amazon.com/ec2/instance-types/Google Scholar
Shaimaa Bajoudah, Dong Changyu, and Paolo Missier. 2019. Toward a Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blockchain. In Procs. 2nd IEEE International Conference on Blockchain (Blockchain 2019). IEEE, Atlanta, USA.Google ScholarCross Ref
Johes Bater, Gregory Elliott, Craig Eggen, Satyender Goel, Abel Kho, and Jennie Rogers. 2016. SMCQL: Secure querying for federated databases. arXiv preprint arXiv:1606.06808 (2016).Google Scholar
Johes Bater, Yongjoo Park, Xi He, Xiao Wang, and Jennie Rogers. 2020. Saqe: practical privacy-preserving approximate query processing for data federations. Proceedings of the VLDB Endowment 13, 12 (2020), 2691--2705.Google ScholarDigital Library
Anant Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J Elmore, Samuel Madden, and Aditya G Parameswaran. [n. d.]. Datahub: Collaborative data science & dataset version management at scale. ([n. d.]).Google Scholar
Christine L Borgman. 2012. The conundrum of sharing research data. Journal of the American Society for Information Science and Technology 63, 6 (2012), 1059--1078.Google ScholarDigital Library
Steven J Brams, Steven John Brams, and Alan D Taylor. 1996. Fair Division: From cake-cutting to dispute resolution. Cambridge University Press.Google Scholar
Anna L Buczak and Erhan Guven. 2015. A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Communications surveys & tutorials 18, 2 (2015), 1153--1176.Google Scholar
Raul Castro Fernandez. 2022. Protecting Data Markets from Strategic Participants. (2022).Google Scholar
Victor Chernozhukov, Hiroyuki Kasahara, and Paul Schrimpf. 2021. Causal impact of masks, policies, behavior on early covid-19 pandemic in the US. Journal of econometrics 220, 1 (2021), 23--62.Google ScholarCross Ref
Rada Chirkova, Jun Yang, et al . 2012. Materialized views. Foundations and Trends® in Databases 4, 4 (2012), 295--405.Google Scholar
Feature Cloud. 2022. Transforming medical research with federated learning. https://featurecloud.eu/about/our-vision/Google Scholar
Thomas M Cover. 1999. Elements of information theory. John Wiley & Sons.Google Scholar
Ronald Cramer, Ivan Bjerre Damgård, et al. 2015. Secure multiparty computation. Cambridge University Press.Google ScholarDigital Library
RALPH D'AGOSTINO and Egon S Pearson. 1973. Tests for departure from normality. Empirical results for the distributions of b 2 and sqrt(b). Biometrika 60, 3 (1973), 613--622.Google Scholar
datacoop 2021. Mozilla Research. Shifting power through data governance. https://foundation.mozilla.org/en/data-futures-lab/data-for-empowerment/shifting-power-through-data-governance/.Google Scholar
datadividend 2021. Data Dividend, My data, my money. https://www.datadividendproject.com/.Google Scholar
Sylvie Delacroix and Neil D Lawrence. 2019. Bottom-up data Trusts: disturbing the ?one size fits all'approach to data governance. International data privacy law 9, 4 (2019), 236--252.Google ScholarCross Ref
Anna Veronika Dorogush, Vasily Ershov, and Andrey Gulin. 2018. CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363 (2018).Google Scholar
Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/mlGoogle Scholar
Muhammad El-Hindi, Carsten Binnig, Arvind Arasu, Donald Kossmann, and Ravi Ramamurthy. 2019. BlockchainDB: A shared database on blockchains. Proceedings of the VLDB Endowment 12, 11 (2019), 1597--1609.Google ScholarDigital Library
André Elisseeff, Massimiliano Pontil, et al . 2003. Leave-one-out error and stability of learning algorithms with applications. NATO science series sub series iii computer and systems sciences 190 (2003), 111--130.Google Scholar
Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018. Aurum: A data discovery system. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 1001--1012.Google ScholarCross Ref
Raul Castro Fernandez, Pranav Subramaniam, and Michael J Franklin. 2020. Data market platforms: Trading data assets to solve data problems. arXiv preprint arXiv:2002.01047 (2020).Google Scholar
Amirata Ghorbani and James Zou. 2019. Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning. PMLR, 2242--2251.Google Scholar
Andrew V Goldberg and Jason D Hartline. 2001. Competitive auctions for multiple digital goods. In European Symposium on Algorithms. Springer, 416--427.Google ScholarDigital Library
Google. 2022. What-If Tool - People AI Research (PAIR). https://pair-code.github.io/what-if-tool/Google Scholar
Robert L Grossman, Allison Heath, Mark Murphy, Maria Patterson, and Walt Wells. 2016. A case for data commons: toward data science as a service. Computing in science & engineering 18, 5 (2016), 10--20.Google Scholar
Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. 2009. The elements of statistical learning: data mining, inference, and prediction. Vol. 2. Springer.Google Scholar
Ling Huang, Anthony D Joseph, Blaine Nelson, Benjamin IP Rubinstein, and J Doug Tygar. 2011. Adversarial machine learning. In Proceedings of the 4th ACM workshop on Security and artificial intelligence. 43--58.Google ScholarDigital Library
Zachary G Ives, Todd J Green, Grigoris Karvounarakis, Nicholas E Taylor, Val Tannen, Partha Pratim Talukdar, Marie Jacob, and Fernando Pereira. 2008. The orchestra collaborative data sharing system. ACM Sigmod Record 37, 3 (2008), 26--32.Google ScholarDigital Library
Marijn Janssen, Yannis Charalabidis, and Anneke Zuiderwijk. 2012. Benefits, adoption barriers and myths of open data and open government. Information systems management 29, 4 (2012), 258--268.Google Scholar
Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas Spanos, and Dawn Song. 2019. Efficient task-specific data valuation for nearest neighbor algorithms. Proceedings of the VLDB Endowment 12, 11 (2019), 1610--1623.Google ScholarDigital Library
Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J Spanos. 2019. Towards efficient data valuation based on the shapley value. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 1167--1176.Google Scholar
Charles I Jones and Christopher Tonetti. 2020. Nonrivalry and the Economics of Data. American Economic Review 110, 9 (2020), 2819--58.Google ScholarCross Ref
Vanja Josifovski, Peter Schwarz, Laura Haas, and Eileen Lin. 2002. Garlic: a new flavor of federated query processing for DB2. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data. 524--532.Google ScholarDigital Library
Kate Keahey, Jason Anderson, Zhuo Zhen, Pierre Riteau, Paul Ruth, Dan Stanzione, Mert Cevik, Jacob Colleran, Haryadi S. Gunawi, Cody Hammock, Joe Mambretti, Alexander Barnes, François Halbach, Alex Rocha, and Joe Stubbs. 2020. Lessons Learned from the Chameleon Testbed. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC '20). USENIX Association.Google Scholar
Rob Kitchin. 2014. The data revolution: Big data, open data, data infrastructures and their consequences. Sage.Google Scholar
Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. 2020. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine 37, 3 (2020), 50--60.Google ScholarCross Ref
Yifan Li, Xiaohui Yu, and Nick Koudas. 2021. Data Acquisition for Improving Machine Learning Models. VLDB 14, 10 (jun 2021), 1832--1844.Google Scholar
Qiongqiong Lin, Jiayao Zhang, Jinfei Liu, Kui Ren, Jian Lou, Junxu Liu, Li Xiong, Jian Pei, and Jimeng Sun. 2021. Demonstration of dealer: an end-to-end model marketplace with differential privacy. Proceedings of the VLDB Endowment 14, 12 (2021), 2747--2750.Google ScholarDigital Library
Jinfei Liu, Jian Lou, Junxu Liu, Li Xiong, Jian Pei, and Jimeng Sun. 2021. Dealer: an end-to-end model marketplace with differential privacy. VLDB (2021).Google ScholarDigital Library
Yu-Chen Lo, Stefano E Rensi, Wen Torng, and Russ B Altman. 2018. Machine learning in chemoinformatics and drug discovery. Drug discovery today 23, 8 (2018), 1538--1546.Google Scholar
RE Machol and J Rosenblatt. 1966. Confidence interval based on single observation. Proc. IEEE 54, 8 (1966), 1087--1088.Google ScholarCross Ref
Roger B Myerson. 1981. Optimal auction design. Mathematics of operations research 6, 1 (1981), 58--73.Google Scholar
Roger B Myerson and Mark A Satterthwaite. 1983. Efficient mechanisms for bilateral trading. Journal of economic theory 29, 2 (1983), 265--281.Google ScholarCross Ref
Michael Naehrig, Kristin Lauter, and Vinod Vaikuntanathan. 2011. Can homomorphic encryption be practical?. In Proceedings of the 3rd ACM workshop on Cloud computing security workshop. 113--124.Google ScholarDigital Library
Fatemeh Nargesian, Erkang Zhu, Renée J Miller, Ken Q Pu, and Patricia C Arocena. 2019. Data lake management: challenges and opportunities. Proceedings of the VLDB Endowment 12, 12 (2019), 1986--1989.Google ScholarDigital Library
NIH. 2023. Final NIH Policy for Data Management and Sharing. https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.htmlGoogle Scholar
Elinor Ostrom. 2008. Tragedy of the commons. The new palgrave dictionary of economics 2 (2008).Google Scholar
Ippokratis Pandis. 2021. The evolution of Amazon redshift. Proceedings of the VLDB Endowment 14, 12 (2021), 3162--3174.Google ScholarDigital Library
Eric A Posner and E Glen Weyl. 2019. Radical Markets. Princeton University Press.Google Scholar
Swiss Re. 2022. Swiss Re to explore AI in reinsurance. https://www.lifeinsuranceinternational.com/news/swiss-re-webank/Google Scholar
Alvin E Roth. 1988. The Shapley value: essays in honor of Lloyd S. Shapley. Cambridge University Press.Google Scholar
Yexuan Shi, Yongxin Tong, Yuxiang Zeng, Zimu Zhou, Bolin Ding, and Lei Chen. 2021. Efficient Approximate Range Aggregation over Large-scale Spatial Data Federation. IEEE Transactions on Knowledge and Data Engineering (2021).Google ScholarCross Ref
Congzheng Song, Thomas Ristenpart, and Vitaly Shmatikov. 2017. Machine learning models that remember too much. In Proceedings of the 2017 ACM SIGSAC Conference on computer and communications security. 587--601.Google ScholarDigital Library
Vasilis Syrgkanis and Eva Tardos. 2013. Composable and efficient mechanisms. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing. 211--220.Google ScholarDigital Library
Ming Tang and Vincent WS Wong. 2021. An incentive mechanism for cross-silo federated learning: A public goods perspective. In IEEE INFOCOM 2021-IEEE Conference on Computer Communications. IEEE, 1--10.Google ScholarDigital Library
Yongxin Tong, Xuchen Pan, Yuxiang Zeng, Yexuan Shi, Chunbo Xue, Zimu Zhou, Xiaofei Zhang, Lei Chen, Yi Xu, Ke Xu, et al. 2022. Hu-Fu: efficient and secure spatial queries over data federation. VLDB (2022).Google Scholar
USGS. 2022. USGS Data-Sharing Agreement. https://www.usgs.gov/data-management/data-sharing-agreementsGoogle Scholar
Melanie M Wall, James Boen, and Richard Tweedie. 2001. An effective confidence interval for the mean with samples of size one and two. The American Statistician 55, 2 (2001), 102--105.Google ScholarCross Ref
Tianhao Wang, Johannes Rausch, Ce Zhang, Ruoxi Jia, and Dawn Song. 2020. A principled approach to data valuation for federated learning. In Federated Learning. Springer, 153--167.Google Scholar
Siyuan Xia, Zhiru Zhu, Chris Zhu, Jinjin Zhao, Kyle Chard, Aaron J Elmore, Ian Foster, Michael Franklin, Sanjay Krishnan, and Raul Castro Fernandez. 2022. Data station: delegated, trustworthy, and auditable computation to enable data-sharing consortia with a data escrow. Proceedings of the VLDB Endowment 15, 11 (2022), 3172--3185.Google ScholarDigital Library
Liqi Xu, Silu Huang, SiLi Hui, Aaron J Elmore, and Aditya Parameswaran. 2017. Orpheusdb: a lightweight approach to relational dataset versioning. In Proceedings of the 2017 ACM International Conference on Management of Data. 1655--1658.Google ScholarDigital Library
Rongfei Zeng, Chao Zeng, Xingwei Wang, Bo Li, and Xiaowen Chu. 2021. A comprehensive survey of incentive mechanism for federated learning. arXiv preprint arXiv:2106.15406 (2021).Google Scholar

Index Terms

Data-Sharing Markets: Model, Protocol, and Algorithms to Incentivize the Formation of Data-Sharing Consortia
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Data management systems

Recommendations

Blockchain-Based Research Data Sharing Framework for Incentivizing the Data Owners
Blockchain – ICBC 2018
Abstract
Data sharing practices are much needed to maximize knowledge gain by researchers. However, when and what data should be shared with whom, and how credit should be awarded to the data owner needs to be clearly addressed to create an individual ...
Read More
Company data sharing, product innovation and competitive strategies
Abstract
With the arrival of big data era, competition among companies has gradually transformed into a competition for data. Data sharing among competitive companies can promote the realization of data value transfer and co-creation, and improve ...
Read More
Ideal dynamic threshold Multi-secret data sharing in smart environments for sustainable cities
Abstract
Nowadays, with continuous integration of big data, artificial intelligence and cloud computing technologies, there are increasing demands and specific requirements for data sharing in sustainable smart cities: (1) practical data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the ACM on Management of Data Volume 1, Issue 2
PACMMOD
June 2023
2310 pages
EISSN:2836-6573
DOI:10.1145/3605748
Editor:
Divyakant Agrawal
UC Santa Barbara, United States
Issue’s Table of Contents
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 June 2023
Published in pacmmod Volume 1, Issue 2

Permissions
Request permissions about this article.
Request Permissions
Author Tags
data markets
data sharing
incentives
machine learning sharing
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 220
  Total Downloads
- Downloads (Last 12 months)220
- Downloads (Last 6 weeks)21
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Data-Sharing Markets: Model, Protocol, and Algorithms to Incentivize the Formation of Data-Sharing Consortia

Proceedings of the ACM on Management of Data

Abstract

References

Cited By

Index Terms

Recommendations

Blockchain-Based Research Data Sharing Framework for Incentivizing the Data Owners

Company data sharing, product innovation and competitive strategies

Ideal dynamic threshold Multi-secret data sharing in smart environments for sustainable cities

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Data-Sharing Markets: Model, Protocol, and Algorithms to Incentivize the Formation of Data-Sharing Consortia

Proceedings of the ACM on Management of Data

Abstract

References

Cited By

Index Terms

Recommendations

Blockchain-Based Research Data Sharing Framework for Incentivizing the Data Owners

Company data sharing, product innovation and competitive strategies

Ideal dynamic threshold Multi-secret data sharing in smart environments for sustainable cities

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media