Keywords

1 Introduction

Over the last decade, there has been a huge technological innovation bringing many research consortiums to use data-driven approaches and to collaborate in making intelligent decisions to improve their scientific research activities. Data sharing practices are undoubtedly necessary to maximize the knowledge gain from the research effort. They can also reduce duplicative trials and accelerate the discovery and generation of new ideas for research. However, when and what data should be shared with whom and by what means, and how credit should be awarded to the dataset owner is still the matter of intense debate and research. This research spirit has been further renewed by the emergence of the privacy issues associated with the users’ data collected by different parties whose primary motive is to have enhanced model while enabling the maximal research knowledge and scientific benefits. Data Analytics methods can improve significantly the quality of services, but they depend on collecting, sharing and mining research data. The user contributes much of the data voluntarily; others are obtained by the system from observation of user activities, or inferred through advanced analysis of volunteered or observed data [1].

In many important domains, for example, medicine and healthcare, both personalized patient care and medical research can benefit from sharing research data from clinical trials in order to maximize the knowledge gain from the research effort [2]. Not often, all possible purposes for use of those data are known in advance, and data owner’s consent needs to be asked again, which can be obtrusive to a researcher/data owner who doesn’t see what there is to gain. In addition, it becomes hard or even impossible for data owners to remember what consent they have given to which enterprise and to keep track of who accesses their data and for what purpose. A flexible mechanism for obtaining and renewing consent for research data usage and sharing is required that provides appropriate and meaningful incentives to capitalize from data sharing and ensures transparency for researchers to be aware of which of their dataset has been accessed, by whom, for what purpose and under what conditions.

It has been observed that the creativity and the advancement of the technologies have given birth to so many computational backbones to ensure privacy and data sharing across intelligent computing and security against hackers [3]. However, these services are often criticized for the centrality issues, as at most cases, they do not collect and share the diverse fragments of user data coming from the enormous autonomous and independent entities [4]. The trust resides within the centralized service providers for all the storage and management of data [5]. In the past few years, distributed ledgers and blockchain technology have evolved as a promising means to support immutable and trusted records in various use cases including healthcare, agricultural research works, tourism domains etc. In addition, many blockchain systems provide a technology called “smart contract” that allows for building automatic verification of the conditions for access or modification of each data entity. Smart contracts can be deployed to encode allowed purposes of research data use, allowed software applications who can access the data, time limitations, the price for access, etc.

This paper proposes a blockchain based research data sharing framework that incentivizes the dataset owners with digital tokens, proper acknowledgment or both, while giving access to the detailed information of all data to them in an immutable and incorruptible database. The rest of the paper is organized as follows. Section 2 describes the overview of blockchains and smart contract. A brief analysis of the existing architectures with their limitations is given in Sect. 3. After that, Sect. 4 presents the solution architecture and our implemented model for the decentralized data sharing in a research domain while ensuring users’ privacy and user control over the data. Finally, Sect. 5 concludes with future directions.

2 Background

2.1 Blockchain

If there were any critical assets going on a supply chain, we could use distributed ledger so that we could see where those goods/assets are, what they are doing and we will also have the trust mechanism behind them so that it will be very difficult for the fraudulent people to inject false goods into the supply chain [5]. This breakthrough technology is also called blockchain. The idea was first stated in the original source code for the digital cash system, Bitcoin [6], but its effect is being observed to be far wider than just the virtual cryptocurrency. The blockchain provides a digital ledger of every record organized in ‘blocks’, which are linked together by cryptographic validation. Each block aggregates a timestamped batch of transactions to be placed in the chain. All those blocks refer to the signature of the previous block in the chain, and that chain can be traced back to the very first genesis block created in the chain.

The blockchain can be private, public or hybrid. MultiChain [7] is best suited for private blockchain which provides the privacy and control required in an easy to configure and deploy package. Unlike any other blockchains, MultiChain solves the problems of mining, privacy, and openness via integrated management of user permissions [7]. Once a blockchain is private, problems relating to scale are easily resolved, since the chain’s participants can control the maximum block size. In MultiChain, all privileges are granted and revoked using network transactions containing a special metadata. The miner of the first “Genesis” block automatically receives all privileges, including administrator rights to manage the privileges of other users. Future versions of MultiChain could also introduce “super administrators” who can assign and revoke privileges on their own.

Similarly, Ethereum is another open source platform to create decentralized applications (dapps) where users interact with the online services in a distributed peer to peer manner that takes place on censorship proof foundation. Developers can create interfaces and business logic with any of the known programming languages and tools. Ethereum has Ether (ETH) as its own virtual currency, which can be used to pay a transaction fee and to provide a primary liquidity layer for exchanging digital assets. There are “messages” in Ethereum being created by either an external entity or a contract unlike the Bitcoin transaction, which can only be created externally [8]. There is also an explicit option for Ethereum messages to contain data and the recipient of an Ethereum message, and if it is a contract account, it has the option to return a response as well. In addition, the Ethereum has “transaction” as the signed data package that stores a message to be sent from an externally owned account. The state in Ethereum is made of accounts each consisting of 20 bytes address and state transitions [8]. We have Ethereum blockchain as a semi-financial application such as on-blockchain escrow, which allows users to enter into contracts and manage them using their ETH to deal with non-monetary assets such as the research data.

2.2 Smart Contracts

The Smart contracts are instances of contracts deployed on some blockchains, for example Ethereum [8]. They consist of different functions that might be called from outside of a blockchain or by other smart contracts. Blockchain coupled with smart contract technology removes the reliance on a central node between the transaction parties. Since the smart contracts are broadcasted on the blockchain, all the connected parties across the entire cryptocurrency network will have a copy of them. A Smart contract, as an important piece of software, stores the rules that negotiate the terms of the contract, automatically verifies the contract and executes the agreed terms [8]. The smart contract can execute agreed stored process when triggered by an authorized or agreed event just like traditional systems. All contract transactions are stored in chronological order for future access along with the complete audit trail of events. If any party tries to change a contract or transaction on the blockchain, all other parties can detect and prevent it. If any party fails, the system continues to functions with no loss of data or integrity. Therefore, it creates a single large secure computer system logically, without the risks, costs and trust issues of a centralized model. The Ethereum Virtual Machine (EVM) code is used in the contracts that consists of bytes, each representing an operation. The code can be written in Solidity language and can access the value of sender and data of the incoming message, block header data, and return a byte array of data as an output.

3 Related Works

According to Bierer et al. [9], data sharing is the “use of research data by persons other than those who originally gathered the data, for no longer a hypothetical or occasional occurrence”. Most of the research on data sharing is relevant to the design framework that focus on the optimization of those properties. However, the technical performance of a data sharing system alone does not guarantee the practicality of the systems. Decentralized approaches for data sharing has achieved the research trend in order to overcome the limitations brought by the centralized architecture, which has a predefined point of access that leads to the central point of failure.

A few prominent examples of data sharing systems include online P2P file-sharing networks and data management systems, collaborative repositories such as Wikidata [10] etc. Almost all of these systems implement different architectures and their evaluation is based on different non-functional requirements, such as efficiency, scalability, or reliability [11]. With regard to data sharing in the cloud, Liu et al. proposed an incentive mechanism into rational secret sharing schemes and a fair data access control scheme for cloud storage [12]. In the scheme, the decryption key reconstruction activity is to be formalized, and then its security, fairness, and correctness are defined. Afterward, the decryption key obfuscation is performed with a generation of a large number of fake keys over the shared data. When rational users exchange the shares, they adjust the action order through the agreed term.

Ozzie et al. [13] provided an architecture that facilitates user-controlled access to user profile information. A user is allowed to selectively mask (expose) portions of her profile to third parties. Advertisers and content providers can offer incentives or enticement in response to the acceptance of which a user exposes larger portions of their profile. Online social network, persona as in [14], can allow users to choose and define the rules who they want to share their personal information like photographs with and browse through highly sensitive data on web pages. It uses attribute-based encryption and public key cryptography to hide data and provide the flexibility needed. For the decryption and authentication by groups and users, it uses group-based access policies. Persona can perform just as well as the existing online social network with added privacy features. Similarly, Houdini framework enables the sharing of context-aware and privacy-conscious user data for global computing [15]. It comes with a method to collect data from various sources focusing on how and when to share them. It is built with an infrastructure to manage principles focusing on the preferences. The main parts of the infrastructure are how well the underlying rules perform and providing the preferences by itself.

In the medical field, the research committee is increasingly recognizing the importance of sharing patients’ level data from clinical trials. European Medicines Agency (EMA), a number of drug companies and one other trial funder have already implemented data sharing [2]. However, the issue with them is to address the appropriate and meaningful incentives to capitalize on the promise of data sharing, and of course, they rely on a centralized system for the data storage and management. Most importantly, with regard to sharing research data, well-developed incentive mechanisms in online communities positively motivate the users to willingly engage in knowledge sharing with others [16]. Next section will provide the methodologies and discussions on our model.

4 Solution Framework and Discussions

Figure 1 presents our general solution architecture that introduces a new way of incentivizing the users for sharing their research data. We have introduced blockchains to share the data among registered parties/enterprises in their private network by incorporating automatic contract so that access-control policies would be stored securely on the blockchain. A user can register into the system by providing her basic profile information, public wallet-address and activate the smart contract, which automates the functionality to support the user-controlled privacy in order: (a) to give the user full transparency over who accesses their data, when and for what purpose, (b) to allow the user to specify a range of purposes of data sharing, kinds of data that can be shared, and classes of applications/companies that can access the data through the smart contract, and (c) to provide an incentive to the user for sharing their research data (in terms of payment for the use of the data by applications, as specified by a smart contract). This user-incentive model with the blockchain is run by the public (Ethereum) blockchain network nodes.

Fig. 1.
figure 1

General user-controlled privacy-preserving data sharing architecture.

Since the smart contract is stored on the public blockchain, the users should have their own digital token addresses safely stored in their personal wallet. Once the users’ data are being used by any other participating parties, then the corresponding users will be incentivized with the digital tokens (ETH). And similarly, for sharing the data among enterprises, private (MultiChain) blockchains are installed on each participating registered node, which can publish the items (research dataset) into the stream to be shared among other nodes in the network.

With reference to this proposed general model, the actual implementation is portrayed with the solution framework in Fig. 2. One of the elements of data sharing would be to whom the data are available for sharing and by what means, and how can the researchers/data owners be incentivized either with digital tokens or with acknowledgment of their efforts in collecting the data. Our system clearly guides registered users about what the smart contracts do on their data. With the smart contract in the public Ethereum blockchain, researchers are able to retain the ownership of data with themselves and are incentivized as per the agreed term. Any academic or industrial unit as a data seeker with valid credentials and approval from a local institutional review board (IRB) is eligible to access the data. The local IRB must also be enlisted into the system by providing the certification that it is bound by regulations to look at scientific methods proposed by a node (data seeker) for accessing the research data. Through the smart contract, only the selected eligible nodes can access the items (dataset) by subscribing to the corresponding published streams. The data owner is incentivized as per the negotiation made on the options between the two parties. An acknowledgment can be given to the data owner during the publication of research article and/or a predefined incentive is offered in the form of the digital token by transferring ETH to the data owners’ Ether addresses. An escrow service can be optionally added into the system so as to bind the users with legal obligations. The access-control policies are stored securely on the blockchain while retaining the same user-interface.

Fig. 2.
figure 2

User-controlled privacy-preserving research data sharing model.

The smart contract is deployed just once for each node on Ethereum blockchain which stores _billingAddress. The smart contract developed with Solidity contains the following functions:

figure a

In order to provide ETH to the data owner (say node1) for accessing the data, a participating eligible data seeker at some node (say node2) queries the system to use the specific filename. The public key cryptography is implemented to ensure the authenticity of the eligible users requesting the file. This results in the execution of the startNewIncentive function of smart contract with the incentiveID and total incentive to be paid to the data owner. The incentiveID is generated for the data owner during registration. Node2 invokes pay function of the smart contract with the incentiveID of Node1 and the ETH to be sent as an incentive to the data owner. The contract verifies the two parameters and then it receives the ETH and updates the status accordingly. It then calls the getStatus function to get the status and with the confirmation of ETH being provided by Node2, data is made available to the data seeker and finally calls the finish function to transfer ETH to the _billingAddress. The ETH is made available in the node1’s account since the incentiveID is paired with the ETH address of the data owner. Thus, the data seeker is entitled to the data while incentivizing the corresponding data owner. There is also no scalability limit in terms of node count for MultiChain blockchain as demonstrated in [17], because each node does not need to connect to every other node to create a fully connected peer-to-peer network. However, for all the node catch-up time, new nodes joining the chain have to replay all transactions from the beginning, and so it can take them significant time before they are up-to-date. The exact amount of time will also depend on how many blocks and transactions are in the chain.

5 Conclusions

In summary, our paper presents a decentralized framework for incentivizing researchers for sharing their research data that provides a way to specify/control the parameters of sharing and providing full accountability of access to such data. The security, scalability, and privacy of those systems are gracefully realized by the implementation of the smart contract and blockchains, which can offer the secure distributed research data-sharing network. Our future works include improving the current model by studying users’ attitudes to research data sharing with blockchain and the incentives they would find attractive for sharing their assets. We will also evaluate usability and usefulness of the approach, and the trust users can have in the system.