Abstract
Organizations that would mutually benefit from pooling their data are otherwise wary of sharing. This is because sharing data is costly-in time and effort-and, at the same time, the benefits of sharing are not clear. Without a clear cost-benefit analysis, participants default in not sharing. As a consequence, many opportunities to create valuable data-sharing consortia never materialize, and the value of data remains locked.
We introduce a new sharing model, market protocol, and algorithms to incentivize the creation of data-sharing markets. The combined contributions of this paper, which we call DSC, incentivize the creation of data-sharing markets that unleash the value of data for its participants. The sharing model introduces two incentives; one that guarantees that participating is better than not doing so and another that compensates participants according to how valuable their data is. Because operating the consortia is costly, we are also concerned with ensuring its operation is sustainable: we design a protocol that ensures that a valuable data-sharing consortium forms when it is sustainable.
We introduce algorithms to elicit the value of data from the participants, which is used first to cover the costs of operating the consortia and second to compensate for data contributions. For the latter, we challenge using the Shapley value to allocate revenue. We offer analytical and empirical evidence for this and introduce an alternative method that compensates participants better and leads to the formation of data-sharing consortia.
- Daniel Abadi, Owen Arden, Faisal Nawab, and Moshe Shadmon. 2020. Anylog: a grand unification of the internet of things. In Conference on Innovative Data Systems Research (CIDR ?20).Google Scholar
- Jacob D Abernethy, Rachel Cummings, Bhuvesh Kumar, Sam Taggart, and Jamie H Morgenstern. 2019. Learning auctions with robust incentive guarantees. Advances in Neural Information Processing Systems 32 (2019).Google Scholar
- Daron Acemoglu, Ali Makhdoumi, Azarakhsh Malekian, and Asuman Ozdaglar. 2019. Too much data: Prices and inefficiencies in data markets. Technical Report. National Bureau of Economic Research.Google Scholar
- Anish Agarwal, Munther Dahleh, and Tuhin Sarkar. 2019. A marketplace for data: An algorithmic solution. In Proceedings of the 2019 ACM Conference on Economics and Computation. 701--726.Google ScholarDigital Library
- Claudia Allen, Terrisca R Des Jardins, Arvela Heider, Kristin A Lyman, Lee McWilliams, Alison L Rein, Abigail A Schachter, Ranjit Singh, Barbara Sorondo, Joan Topper, et al . 2014. Data governance and data sharing agreements for community-wide health information exchange: lessons from the beacon communities. EGEMS 2, 1 (2014).Google Scholar
- Nuno Antonio, Ana de Almeida, and Luis Nunes. 2019. Hotel booking demand datasets. Data in brief 22 (2019), 41--49.Google Scholar
- Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. 2021. Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. In Proceedings of CIDR.Google Scholar
- Kenneth Arrow. 1962. Economic welfare and the allocation of resources for invention. In The rate and direction of inventive activity: Economic and social factors. Princeton University Press, 609--626.Google Scholar
- Lawrence M Ausubel and Peter Cramton. 2002. Demand reduction and inefficiency in multi-unit auctions. (2002).Google Scholar
- Amazon AWS. 2022. Amazon AWS Instance Types. https://aws.amazon.com/ec2/instance-types/Google Scholar
- Shaimaa Bajoudah, Dong Changyu, and Paolo Missier. 2019. Toward a Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blockchain. In Procs. 2nd IEEE International Conference on Blockchain (Blockchain 2019). IEEE, Atlanta, USA.Google ScholarCross Ref
- Johes Bater, Gregory Elliott, Craig Eggen, Satyender Goel, Abel Kho, and Jennie Rogers. 2016. SMCQL: Secure querying for federated databases. arXiv preprint arXiv:1606.06808 (2016).Google Scholar
- Johes Bater, Yongjoo Park, Xi He, Xiao Wang, and Jennie Rogers. 2020. Saqe: practical privacy-preserving approximate query processing for data federations. Proceedings of the VLDB Endowment 13, 12 (2020), 2691--2705.Google ScholarDigital Library
- Anant Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J Elmore, Samuel Madden, and Aditya G Parameswaran. [n. d.]. Datahub: Collaborative data science & dataset version management at scale. ([n. d.]).Google Scholar
- Christine L Borgman. 2012. The conundrum of sharing research data. Journal of the American Society for Information Science and Technology 63, 6 (2012), 1059--1078.Google ScholarDigital Library
- Steven J Brams, Steven John Brams, and Alan D Taylor. 1996. Fair Division: From cake-cutting to dispute resolution. Cambridge University Press.Google Scholar
- Anna L Buczak and Erhan Guven. 2015. A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Communications surveys & tutorials 18, 2 (2015), 1153--1176.Google Scholar
- Raul Castro Fernandez. 2022. Protecting Data Markets from Strategic Participants. (2022).Google Scholar
- Victor Chernozhukov, Hiroyuki Kasahara, and Paul Schrimpf. 2021. Causal impact of masks, policies, behavior on early covid-19 pandemic in the US. Journal of econometrics 220, 1 (2021), 23--62.Google ScholarCross Ref
- Rada Chirkova, Jun Yang, et al . 2012. Materialized views. Foundations and Trends® in Databases 4, 4 (2012), 295--405.Google Scholar
- Feature Cloud. 2022. Transforming medical research with federated learning. https://featurecloud.eu/about/our-vision/Google Scholar
- Thomas M Cover. 1999. Elements of information theory. John Wiley & Sons.Google Scholar
- Ronald Cramer, Ivan Bjerre Damgård, et al. 2015. Secure multiparty computation. Cambridge University Press.Google ScholarDigital Library
- RALPH D'AGOSTINO and Egon S Pearson. 1973. Tests for departure from normality. Empirical results for the distributions of b 2 and sqrt(b). Biometrika 60, 3 (1973), 613--622.Google Scholar
- datacoop 2021. Mozilla Research. Shifting power through data governance. https://foundation.mozilla.org/en/data-futures-lab/data-for-empowerment/shifting-power-through-data-governance/.Google Scholar
- datadividend 2021. Data Dividend, My data, my money. https://www.datadividendproject.com/.Google Scholar
- Sylvie Delacroix and Neil D Lawrence. 2019. Bottom-up data Trusts: disturbing the ?one size fits all'approach to data governance. International data privacy law 9, 4 (2019), 236--252.Google ScholarCross Ref
- Anna Veronika Dorogush, Vasily Ershov, and Andrey Gulin. 2018. CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363 (2018).Google Scholar
- Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/mlGoogle Scholar
- Muhammad El-Hindi, Carsten Binnig, Arvind Arasu, Donald Kossmann, and Ravi Ramamurthy. 2019. BlockchainDB: A shared database on blockchains. Proceedings of the VLDB Endowment 12, 11 (2019), 1597--1609.Google ScholarDigital Library
- André Elisseeff, Massimiliano Pontil, et al . 2003. Leave-one-out error and stability of learning algorithms with applications. NATO science series sub series iii computer and systems sciences 190 (2003), 111--130.Google Scholar
- Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018. Aurum: A data discovery system. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 1001--1012.Google ScholarCross Ref
- Raul Castro Fernandez, Pranav Subramaniam, and Michael J Franklin. 2020. Data market platforms: Trading data assets to solve data problems. arXiv preprint arXiv:2002.01047 (2020).Google Scholar
- Amirata Ghorbani and James Zou. 2019. Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning. PMLR, 2242--2251.Google Scholar
- Andrew V Goldberg and Jason D Hartline. 2001. Competitive auctions for multiple digital goods. In European Symposium on Algorithms. Springer, 416--427.Google ScholarDigital Library
- Google. 2022. What-If Tool - People AI Research (PAIR). https://pair-code.github.io/what-if-tool/Google Scholar
- Robert L Grossman, Allison Heath, Mark Murphy, Maria Patterson, and Walt Wells. 2016. A case for data commons: toward data science as a service. Computing in science & engineering 18, 5 (2016), 10--20.Google Scholar
- Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. 2009. The elements of statistical learning: data mining, inference, and prediction. Vol. 2. Springer.Google Scholar
- Ling Huang, Anthony D Joseph, Blaine Nelson, Benjamin IP Rubinstein, and J Doug Tygar. 2011. Adversarial machine learning. In Proceedings of the 4th ACM workshop on Security and artificial intelligence. 43--58.Google ScholarDigital Library
- Zachary G Ives, Todd J Green, Grigoris Karvounarakis, Nicholas E Taylor, Val Tannen, Partha Pratim Talukdar, Marie Jacob, and Fernando Pereira. 2008. The orchestra collaborative data sharing system. ACM Sigmod Record 37, 3 (2008), 26--32.Google ScholarDigital Library
- Marijn Janssen, Yannis Charalabidis, and Anneke Zuiderwijk. 2012. Benefits, adoption barriers and myths of open data and open government. Information systems management 29, 4 (2012), 258--268.Google Scholar
- Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li, Ce Zhang, Costas Spanos, and Dawn Song. 2019. Efficient task-specific data valuation for nearest neighbor algorithms. Proceedings of the VLDB Endowment 12, 11 (2019), 1610--1623.Google ScholarDigital Library
- Ruoxi Jia, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J Spanos. 2019. Towards efficient data valuation based on the shapley value. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 1167--1176.Google Scholar
- Charles I Jones and Christopher Tonetti. 2020. Nonrivalry and the Economics of Data. American Economic Review 110, 9 (2020), 2819--58.Google ScholarCross Ref
- Vanja Josifovski, Peter Schwarz, Laura Haas, and Eileen Lin. 2002. Garlic: a new flavor of federated query processing for DB2. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data. 524--532.Google ScholarDigital Library
- Kate Keahey, Jason Anderson, Zhuo Zhen, Pierre Riteau, Paul Ruth, Dan Stanzione, Mert Cevik, Jacob Colleran, Haryadi S. Gunawi, Cody Hammock, Joe Mambretti, Alexander Barnes, François Halbach, Alex Rocha, and Joe Stubbs. 2020. Lessons Learned from the Chameleon Testbed. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC '20). USENIX Association.Google Scholar
- Rob Kitchin. 2014. The data revolution: Big data, open data, data infrastructures and their consequences. Sage.Google Scholar
- Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. 2020. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine 37, 3 (2020), 50--60.Google ScholarCross Ref
- Yifan Li, Xiaohui Yu, and Nick Koudas. 2021. Data Acquisition for Improving Machine Learning Models. VLDB 14, 10 (jun 2021), 1832--1844.Google Scholar
- Qiongqiong Lin, Jiayao Zhang, Jinfei Liu, Kui Ren, Jian Lou, Junxu Liu, Li Xiong, Jian Pei, and Jimeng Sun. 2021. Demonstration of dealer: an end-to-end model marketplace with differential privacy. Proceedings of the VLDB Endowment 14, 12 (2021), 2747--2750.Google ScholarDigital Library
- Jinfei Liu, Jian Lou, Junxu Liu, Li Xiong, Jian Pei, and Jimeng Sun. 2021. Dealer: an end-to-end model marketplace with differential privacy. VLDB (2021).Google ScholarDigital Library
- Yu-Chen Lo, Stefano E Rensi, Wen Torng, and Russ B Altman. 2018. Machine learning in chemoinformatics and drug discovery. Drug discovery today 23, 8 (2018), 1538--1546.Google Scholar
- RE Machol and J Rosenblatt. 1966. Confidence interval based on single observation. Proc. IEEE 54, 8 (1966), 1087--1088.Google ScholarCross Ref
- Roger B Myerson. 1981. Optimal auction design. Mathematics of operations research 6, 1 (1981), 58--73.Google Scholar
- Roger B Myerson and Mark A Satterthwaite. 1983. Efficient mechanisms for bilateral trading. Journal of economic theory 29, 2 (1983), 265--281.Google ScholarCross Ref
- Michael Naehrig, Kristin Lauter, and Vinod Vaikuntanathan. 2011. Can homomorphic encryption be practical?. In Proceedings of the 3rd ACM workshop on Cloud computing security workshop. 113--124.Google ScholarDigital Library
- Fatemeh Nargesian, Erkang Zhu, Renée J Miller, Ken Q Pu, and Patricia C Arocena. 2019. Data lake management: challenges and opportunities. Proceedings of the VLDB Endowment 12, 12 (2019), 1986--1989.Google ScholarDigital Library
- NIH. 2023. Final NIH Policy for Data Management and Sharing. https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.htmlGoogle Scholar
- Elinor Ostrom. 2008. Tragedy of the commons. The new palgrave dictionary of economics 2 (2008).Google Scholar
- Ippokratis Pandis. 2021. The evolution of Amazon redshift. Proceedings of the VLDB Endowment 14, 12 (2021), 3162--3174.Google ScholarDigital Library
- Eric A Posner and E Glen Weyl. 2019. Radical Markets. Princeton University Press.Google Scholar
- Swiss Re. 2022. Swiss Re to explore AI in reinsurance. https://www.lifeinsuranceinternational.com/news/swiss-re-webank/Google Scholar
- Alvin E Roth. 1988. The Shapley value: essays in honor of Lloyd S. Shapley. Cambridge University Press.Google Scholar
- Yexuan Shi, Yongxin Tong, Yuxiang Zeng, Zimu Zhou, Bolin Ding, and Lei Chen. 2021. Efficient Approximate Range Aggregation over Large-scale Spatial Data Federation. IEEE Transactions on Knowledge and Data Engineering (2021).Google ScholarCross Ref
- Congzheng Song, Thomas Ristenpart, and Vitaly Shmatikov. 2017. Machine learning models that remember too much. In Proceedings of the 2017 ACM SIGSAC Conference on computer and communications security. 587--601.Google ScholarDigital Library
- Vasilis Syrgkanis and Eva Tardos. 2013. Composable and efficient mechanisms. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing. 211--220.Google ScholarDigital Library
- Ming Tang and Vincent WS Wong. 2021. An incentive mechanism for cross-silo federated learning: A public goods perspective. In IEEE INFOCOM 2021-IEEE Conference on Computer Communications. IEEE, 1--10.Google ScholarDigital Library
- Yongxin Tong, Xuchen Pan, Yuxiang Zeng, Yexuan Shi, Chunbo Xue, Zimu Zhou, Xiaofei Zhang, Lei Chen, Yi Xu, Ke Xu, et al. 2022. Hu-Fu: efficient and secure spatial queries over data federation. VLDB (2022).Google Scholar
- USGS. 2022. USGS Data-Sharing Agreement. https://www.usgs.gov/data-management/data-sharing-agreementsGoogle Scholar
- Melanie M Wall, James Boen, and Richard Tweedie. 2001. An effective confidence interval for the mean with samples of size one and two. The American Statistician 55, 2 (2001), 102--105.Google ScholarCross Ref
- Tianhao Wang, Johannes Rausch, Ce Zhang, Ruoxi Jia, and Dawn Song. 2020. A principled approach to data valuation for federated learning. In Federated Learning. Springer, 153--167.Google Scholar
- Siyuan Xia, Zhiru Zhu, Chris Zhu, Jinjin Zhao, Kyle Chard, Aaron J Elmore, Ian Foster, Michael Franklin, Sanjay Krishnan, and Raul Castro Fernandez. 2022. Data station: delegated, trustworthy, and auditable computation to enable data-sharing consortia with a data escrow. Proceedings of the VLDB Endowment 15, 11 (2022), 3172--3185.Google ScholarDigital Library
- Liqi Xu, Silu Huang, SiLi Hui, Aaron J Elmore, and Aditya Parameswaran. 2017. Orpheusdb: a lightweight approach to relational dataset versioning. In Proceedings of the 2017 ACM International Conference on Management of Data. 1655--1658.Google ScholarDigital Library
- Rongfei Zeng, Chao Zeng, Xingwei Wang, Bo Li, and Xiaowen Chu. 2021. A comprehensive survey of incentive mechanism for federated learning. arXiv preprint arXiv:2106.15406 (2021).Google Scholar
Index Terms
- Data-Sharing Markets: Model, Protocol, and Algorithms to Incentivize the Formation of Data-Sharing Consortia
Recommendations
Blockchain-Based Research Data Sharing Framework for Incentivizing the Data Owners
Blockchain – ICBC 2018AbstractData sharing practices are much needed to maximize knowledge gain by researchers. However, when and what data should be shared with whom, and how credit should be awarded to the data owner needs to be clearly addressed to create an individual ...
Company data sharing, product innovation and competitive strategies
AbstractWith the arrival of big data era, competition among companies has gradually transformed into a competition for data. Data sharing among competitive companies can promote the realization of data value transfer and co-creation, and improve ...
Ideal dynamic threshold Multi-secret data sharing in smart environments for sustainable cities
AbstractNowadays, with continuous integration of big data, artificial intelligence and cloud computing technologies, there are increasing demands and specific requirements for data sharing in sustainable smart cities: (1) practical data ...
Comments