Keywords

1 Introduction

In today’s big data era, digital data are explosively generated and people are increasingly outsourcing their data to cloud servers so as to enjoy efficient data management services without bearing heavy local storage costs [1]. While cloud storage is very helpful in many aspects, public cloud service could put user information in danger [2]. Meanwhile, it is also very important to keep track of what happens to these data throughout its lifecycle (from creation to ownership transfer to destruction or deletion), such as its ownership and custodial history as well as how it has been accessed by its users, which is also known as data provenance [3, 4]. For example, in a digital investigation, digital evidences must be strictly secured and clearly documented about its ownership transfer as well as how it was handled during its lifecycle. It is usual that the defendant challenges the authenticity of a digital evidence during the trial. The most common types of digital evidence are hard disk images, and the defendant may question the hard disk image that investigators are working on and presented in the courtroom is not the same one acquired from the hard disk found at the crime scene.

In the past decades, many security mechanisms have been developed to ensure security and privacy of sensitive information, as well as achieve accountability and auditability through data access logging or audit trails [5], such as logging activities on data creation, modification, and access. Much of the focus has been on protecting digitally stored information from unauthorized access or modification. However, despite extensive research on information security and privacy, little attention has been paid to securing provenance information and providing assurance that a data document is trustworthy. It is worth mentioning that as the current best of practice, log files are also protected from tampering. In the banking industry, any activities, such as bank transfers, can only be recorded by creating a new log, and past logs cannot be modified or deleted for security reasons. Nevertheless, another important question still needs to be answered about whether the provenance information can be trusted to make sure that the corresponding data document is a trusted one after a series of user activities on the document which have been detailed in the provenance information.

Unlike the traditional file access auditing where file access activities are logged, provenance information contains the ownership history of data documents as well as activities occurred on them by their owners or users. Furthermore, such information is organized in chronological order during the lifecycles of the data documents, and allows accesses and activities of data documents to be tracked. As a result, it not only improves accountability and reliability [6], but also meets the requirements of emerging applications, such as maintaining the digital chain of custody in a digital forensics investigation [7,8,9], as well as regulatory compliance requirements and industry standards, such as HIPAA [10].

Actually, provenance information is not useful if it cannot be trusted, it is inadvisable to trust provenance information without proper protection. For example, whenever a Microsoft office document is created, Microsoft office automatically embeds an author name into the document. It has been proved very useful to solve crime. A good example of it is the BTK killer case [11]. Nevertheless, the assigned name for the document can be easily changed. This problem is further exacerbated by the fact that data documents and the corresponding provenance records are outsourced to the cloud storage which cannot be fully trusted, since the data documents and the provenance information would not be physically owned by data owners and they are transmitted over an insecure network [12]. Hence, it is crucial to ensure provenance information security.

Provenance information can be modeled as a sequence of records (each known as provenance record) which present details about how a data document was processed at every stage of its lifecycle. To guarantee the security of provenance information, similar to protection of data document itself, we can protect individual records of provenance information from unauthorized use or modification. However, the integration of security mechanisms in current clouds to ensure the security of the individual records would incur additional costs on both the cloud provider and users. As such, it is vital to point out what affects the practicality of secure provenance schemes and define how to evaluate the practicality.

In existing schemes [3, 13], an identity manager is introduced to efficiently secure the outsourced data provence information, where any operation on the data performed by a user is required to be authorized by the identity manager. By doing so, the provenance records actually reflect the state information on the data during its lifecycle. Nevertheless, such mechanism suffers from a strong assumption that the identity manager is honest and reliable. Once the identity manager is compromised, the security of these schemes are broken: if the cloud server colludes with the identity manager, the outsourced provenance records can be modified without detection. In reality, compromising the identity manager is feasible for adversaries, since an adversary (e.g., a malicious cloud server) can perpetually incentivize the identity manager to deviate from the prescribed scheme over a long period of time. Furthermore, existing schemes do not consider the timeliness of provenance records.

In this paper, we propose an efficient and secure data provenance scheme for cloud storage systems called ESP. ESP is secure against provenance record forgery, removal, modification attacks. The security of ESP is guaranteed in the case that the identity manager is compromised, even if the malicious cloud server colludes with it. The key technique behind ESP is the blockchain-based currencies which provide a secure way to conduct transactions without a central authority. In ESP, each provenance record is integrated into a transaction on the blockchain, and all provenance records corresponding to one data form a record chain such that any one of them is corrupted, the chain is broken. With the integration of a provenance record in a transaction on the blockchain, the provenance record is time-stamped, and the time when the provenance record was generated can be extracted. As such, the provenance records in ESP not only keep track of what happened to the data, but also reflect when the data was processed. Detailed security analysis proves that ESP is secure against various attacks, even if the malicious cloud server/user colludes with the identity manager. Moreover, we introduce a concept of window of latching (WoL) which is one of the most important factors that affects the practicality of secure provenance schemes. We implement ESP and evaluate its performance. Experiment results show that WoL of ESP is short and is acceptable in reality, which demonstrates ESP is efficient and practical. Specifically, the contributions of this work are as follows:

  • We formalize a model of data provenance, where the lifecycle of data documents is formally formulated. We also introduce the concept of WoL to measure the practicality of secure provenance schemes.

  • We propose an efficient and secure data provenance scheme (ESP) for cloud storage systems. ESP employs a provenance record chain which is built on blockchain-based currencies, e.g., Ethereum, this ensures the secure auditability of provenance records in terms of correctness, integrity, and timeliness, even if the identity manager is compromised.

  • We present security analysis to demonstrate that ESP can be secure and robust from various attacks. We implement ESP and conduct a comprehensive performance evaluation, which shows that ESP is highly efficient and practical.

2 Related Work

Data provenance provides sufficient information about target data that what happened to the data from creation to destruction. As we are moving into the age of big data where digital data are explosively generated nowadays and most of data are managed via the Internet with the aid of cloud systems, data provenance is pretty important to digital investigations [14, 15]. Once a dispute arises in outsourced data, provenances serve as the most vital evidences for post investigation.

Lynch [16] first pointed out the need for trust and provenance in information retrieval. Hasan et al. [2] first defined the problem of secure provenance and argued that it is of vital importance in practice. Prior work on secure data provenance in cloud storage systems was proposed by Lu et al. [3], where the basic security requirements were first enumerated, i.e., unforgeability and conditional privacy preservation. The unforgeability ensures that a provenance record reflects the corresponding state of data, even if the data and the provenance record are outsourced to an untrusted environment; The conditional privacy preservation guarantees that only an authenticated entity can reveal the real identity recorded in the provenance, while anyone else cannot.

Following the Lu et al.’s work, several secure data provenance schemes have been proposed [13, 17]. These schemes mainly focus on enhancing the functionality of secure provenance for cloud storage systems. However, in existing schemes, a trusted identity manager is introduced to secure the provenance records. If the identity manager is compromised, the security would be broken. Moreover, in existing schemes, lifecycles of data documents in cloud storage are not considered, the timeliness of provenance records has not been explored, and how to measure the practicality of secure provenance schemes is also not well investigated. In this paper, we propose ESP, an efficient and secure data provenance scheme that ensures the correctness, integrity, and timeliness of provenance records against the malicious identity manager.

3 Preliminaries

3.1 Basic Cryptographic Primitives

Secure Hash Function. A secure hash function h has the following three properties: h can take a message of arbitrary length as input, and output a short fixed-size message digest; Given x, it is easy to compute \(h(x) = y\). However, given y, it is hard to calculate \(h^{-1}(y) = x\); Given x, it is computationally infeasible to find \(x'\ne x\) such that \(h(x') = h(x)\).

Bilinear Maps. Let G be an additive group and \(G_{T}\) be a multiplicative group, they have the same prime order p. A bilinear map e: \(G\times G\rightarrow G_{T}\) has the following properties. Bilinearity: \(e(aP, bQ) = e(P,Q)^{ab}\) for all \(P,Q\in G,\ a,b\in Z_{p}^{*}\); Non-degeneracy: for \(P,Q\in G\) and \(P\ne Q\), \(e(P,Q)\ne 1\); e can be computed efficiently.

Fig. 1.
figure 1

A simplified Ethereum blockchain

3.2 Blockchain

We defer a brief introduction to the blockchain technique to Appendix A. Blockchain. We construct ESP on the Ethereum blockchain, since Ethereum is more expressive than other blockchain-based currencies. We show a simplified Ethereum blockchain in Fig. 1, where \(\mathsf {Tx}\) denotes the transaction, \(\mathsf {BlockHash}\) denotes the hash value of current block, \(\mathsf {PrevBlockHash}\) denotes the hash value of the last block, \(\mathsf {Time}\) denotes the time when the block is chained to the blockchain, and \(\mathsf {MerkleRoot}\) denotes the root value of a Merkle hash tree formed by all transactions recorded in the block. The value token of the Ethereum blockchain is called Ethers.

In Ethereum, the state is made up of objects called “account”. Generally, there are two types of accounts: externally owned accounts and contract accounts. Externally owned accounts are controlled by private keys and can conduct a transaction. Contract accounts are controlled by their contract code. For a transaction between two external owned accounts, if it is recorded into the blockchain, the balances of these two accounts are updated, where the user who conducts the transaction can set the “data” field to be any binary data she/he chooses. Therefore, Ethereum blockchain ensures the timeliness of the data state: when a payer transfers Ethers to a payee, a string \(\varDelta \) can be set to be the Data value of the transaction; After the block containing this transaction is added into the blockchain, \(\varDelta \) is recorded, which means that \(\varDelta \) is generated no later than the time when the block is chained to the block.

4 Models and Design Goals

4.1 A Model of Data Provenance

Lifecycle of a Data Document and Its Users. In this work, a data document lifecycle is viewed as a sequence of stages from creation to modification, destruction, and ownership transfer, which is shown in Fig. 2. After a data document has been created, it may go through many stages due to the document modification or ownership transfer. Finally, a document may be destructed or deleted, becoming unavailable to the users. Hence, an individual state of a data document can be uniquely identified by its content and owner, and can be represented as \(St_i=\mathcal {H}(F_i,O_i)\), where \(St_i\) stands for a state where a document has been at, \(F_i\) means the content of the document at state of \(St_i\), \(O_i\) is the owner of the document at state of \(St_i\), and \(\mathcal {H}\) stands for a secure hash function.

During the lifetime of a data document, users can play different roles in it, and can be classified into four types in general: creator, owner, editor, and viewer.

Creator: showing a user is the creator of a data document.

Owner: showing a user is the owner of the data document. By taking ownership of a data document, the user can assign other users permissions to the data document, including editing and viewing a data document, and transferring ownership of a data document. By default, the creator of a data document is also the owner, but document ownership can be transferred to another user by its current owner.

Editor: identifying a user is able to edit the data document.

Viewer: identifying a user can only view the data document.

Documents can have many editors and viewers, but only one creator during its lifetime and one owner at a time. In addition to the four aforementioned types of users, we assume there exists an auditor who can verify the validity and trustworthiness of any provenance information but without any knowledge of the user’s identity who generates each individual provenance record [2, 3].

Fig. 2.
figure 2

Document lifecycle

Fig. 3.
figure 3

Provenance model

Provenance Model. As shown in Fig. 3, in data provenance, provenance information is organized into a chain in chronological order, where each chain item represents a provenance record which details how a data document was processed at every stage of its lifecycle. Each provenance record is also associated with a specific document stage, and a legitimate user (e.g., editors) may perform many actions on data documents. A typical provenance record consists of a specially formatted data block that contains information related to how a data document is processed at a time as well as its ownership information, which usually can be classified into two types: Essential provenance data (EPD): information related to activities performed on the data document; Nonessential provenance data (NPD): security overhead which has been generated by security mechanisms that are used to protect provenance information.

Measure the Practicality of Secure Provenance. With the provenance model, we introduce a concept of window of latching (WoL) to evaluate the practicality of secure provenance schemes.

Definition 1

Window of latching (WoL) means the time-interval between two successive provenance records that are accepted and published. The shorter WoL, the more practical the secure provenance scheme is.

4.2 Threat Model

In our threat model, we mainly consider the following security and privacy threats against data provenance.

Provenance Record Forgery Attack. A malicious user may collude with others to forge a valid provenance record in terms of the record’s content and its timeliness.

Provenance Record Removal Attack. A malicious user colludes with others to remove one or several existing provenance records that have been generated due to the operations performed on data documents.

Modification Attack. Similar to the two above threats, a malicious user who may collude with others may attempt to tamper with provenance information by modifying the provenance records.

Repudiation Attack. A malicious user may deny that he performed an action on a data document.

Privacy Violation. Privacy violation refers to the attack that the identity of user who generates a provenance record is leaked out. Recall that in secure provenance schemes [3, 13], only conditional privacy preservation can be ensured, where the identity manager has the ability to reveal the real identity recorded in the provenance record, while anyone else cannot.

4.3 System Model and Design Goals

As shown in Fig. 4, there are four different entities in ESP: users, an authenticated server, a cloud server, and an independent auditor. The authenticated server is used to authorize the users and control who can access the data. It also assists users in preserving their identity against adversaries. Data and the corresponding provenance information are generated by the users, and are stored in the cloud server. The auditor can check the provenance records’ validity including their correctness, integrity, and timeliness.

Different from existing schemes [3, 13], the authenticated server is not fully trusted by others, and thereby should be responsible for all its authorizations. As long as the authenticated server remains inaccessible to adversaries, we ensure both the security and privacy preservation. If both the authenticated server and cloud storage server are compromised, we retain the security guarantees on the provenance records in existing schemes.

Fig. 4.
figure 4

The system model of ESP

Fig. 5.
figure 5

Blockchain-based provenance record chain

Aiming at the above security challenges, our design goal is to develop an efficient and secure data provenance scheme in the cloud storage system. Specifically, the following goals should be achieved.

Security and Privacy Preservation. The validity of provenance records can be audited by authorized auditors with resistance against various attacks. The conditional privacy can be ensured.

Efficiency. It should efficiently work without introducing too much extra storage space caused by introducing security mechanisms, for example, digital signatures and cryptographic hashes. Its WoL should be as short as possible such that it can be applied in reality.

5 The Proposed Scheme

5.1 Overview

ESP consists of three parts: system setup, secure provenance generation, and secure provenance verification.

In the first part, the authenticated server assigns a human-memorisable password to each user, and maintains a list that records the assigned passwords and the corresponding identities. With the list, the authenticated server can authenticate each user securely and efficiently.

When a user wants to process a document, she/he needs to be authorized by the authenticated server. Then the authenticated server assists the user in generating a provenance record, this enables the user to prove herself/himself to the cloud server that she/he is qualified to process the document. Each provenance record is integrated into a transaction on the blockchain, where the user transfers a service charge to the authenticated server. This also time-stamps the provenance record.

To achieve the security, all provenance records are chained together with the aid of the Ethereum blockchain. This is shown in Fig. 5, where the provenance record chain is indicated by dashed gray lines. Assume that there currently are n provenance records, \(P_{1}, P_{2}, \cdots , P_{n}\), each of them stands for a state of the underlying document at the corresponding stage during its liftcycle as modeled in Sect. 4.1. They are chained together as follows: from the second provenance record \(P_{2}\), each record contains a data field that points to a block on the Ethereum blockchain, this block relates to the last provenance record. Each record is appended to the last one until it reaches the last one of the current provenance information, \(P_{n}\). Finally, the last record will be signed by the authenticated server and the signature becomes the tail of the provenance record chain. In this case, if any existing record is modified or removed, the provenance record chain is broken. The computational costs to verify provenance records mainly depend on the hashing operation along with one signature verification for the last element or the tail of the provenance record chain. As a result, the verification is very fast.

5.2 Description of ESP

A set of user \(\mathcal {U} = \{\mathcal {U}_{1}, \mathcal {U}_{2}, ...\}\), an authenticated server \(\mathcal {AS}\), a cloud storage server \(\mathcal {C}\), and a third-party auditor \(\mathcal {A}\) are involved in ESP.

System Setup:

  • With the security parameter \(\ell \), the system parameters \(\{p\), G, \(G_{T}\), P, e, \(E(\cdot )\), h, \(H\}\) are determined, where G is an additive group whose generator is P, \(e: G\times G\rightarrow G_{T}\), G and \(G_{T}\) have the same prime order p, \(E(\cdot )\) is a secure symmetric encryption algorithm, \(h: \{0,1\}^{*}\rightarrow Z_{p}^{*}\), and \(H: \{0,1\}^{*}\rightarrow G\).

  • \(\mathcal {AS}\) randomly chooses \(s\in Z_{p}^{*}\), and computes \(P_{pub} = sP\), and \(k = h(s)\).

  • \(\mathcal {AS}\)’s secret keys are (s, k), the corresponding public key is \(P_{pub}\).

  • For each \(\mathcal {U}_{i}\in \mathcal {U}\) with identifier \(ID_{i}\), \(\mathcal {AS}\) assigns a human-memorisable password \(pwd_{i}\) to her/him, and stores \((ID_{i}, pwd_{i})\) locally.

Secure Provenance Generation:

Once a user \(\mathcal {U}_{i}\) processes a document at \(\mathcal {C}\) and generates a provenance record \(P_{j}\), she/he will request \(\mathcal {AS}\) to generate secure provenance on the document process.

Phase 1: With the identifier \(ID_{i}\) and password \(pwd_{i}\), \(\mathcal {U}_{i}\) makes mutual authentication with \(\mathcal {AS}\) to establish a secure channel as follows.

  • \(\mathcal {U}_{i}\) randomly chooses \(r_{1}, a\in Z_{p}^{*}\), and fetches the current timestamp ct.

  • \(\mathcal {U}_{i}\) computes \(C_{1}, C_{2}\), where \(C_{1} = r_{1}P\), \(C_{2} = E_{k}(ID_{i}||pwd_{i}||aP||ct)\), and \(k = r_{1} P_{pub}\).

  • \(\mathcal {U}_{i}\) sends \((C_{1}, C_{2})\) to \(\mathcal {AS}\).

  • After receiving \((C_{1}, C_{2})\), \(\mathcal {AS}\) computes \(k = sC_{1} = sr_{1}P = r_{1}P_{pub}\), and extracts \(ID_{i}||pwd_{i}||aP||ct\) from \(C_{2}\) with k.

  • \(\mathcal {AS}\) checks the validity of the timestamp ct to resist the replay attack.

  • \(\mathcal {AS}\) authenticates \(\mathcal {U}_{i}\) by checking whether \((ID_{i}, pwd_{i})\) is stored locally.

  • \(\mathcal {AS}\) randomly chooses \(b\in Z_{p}^{*}\) and computes \(sk = b(aP)\) as the session key.

  • \(\mathcal {AS}\) calculates \(\mathcal {U}_{i}\)’s pseudonym \(PID_{j} = E_{k}(ID_{i}||ct||b)\), \(C_{3}\), and \(C_{4}\), where \(C_3 = bP\), \(C_4 = E_{sk}(ID_{i}||aP||bP||ct||PID_{j})\).

  • \(\mathcal {AS}\) sends \((C_{3}, C_{4})\) to \(\mathcal {U}_{i}\).

  • With \((C_{3}, C_{4})\), \(\mathcal {U}_{i}\) computes the session key \(sk = aC_{3} = abP\).

  • \(\mathcal {U}_{i}\) extracts \(ID_{i}||aP||bP||ct||PID_{j}\) from \(C_{4}\) with sk.

  • \(\mathcal {U}_{i}\) authenticates \(\mathcal {AS}\) and confirms the correctness of sk by verifying the correctness of \(ID_{i}||ct||aP||bP\).

  • Since the session key sk is shared between \(\mathcal {U}_{i}\) and \(\mathcal {AS}\), a secure channel between them is established for secure provenance.

Different roles of \(\mathcal {U}_{i}\) require different execution between \(\mathcal {U}_{i}\) and \(\mathcal {AS}\).

Creator: \(\mathcal {U}_{i}~\text {creates a new document}\)

If \(\mathcal {U}_{i}\) creates a new document F, i.e., the provenance record \(P_{j}\) is \(P_{1}\), where \(P_{1} = h(F_{1}||ID_{i})\) and \(F_{1}\) denotes the content of the document at the first stage, she/he requests a secure provenance from \(\mathcal {AS}\) as follows.

  • \(\mathcal {U}_{i}\) sends \(P_{1}\) to \(\mathcal {AS}\) via the secure channel.

  • \(\mathcal {AS}\) extracts \(PID_{1}\) from local storage (i.e., \(j=1\)).

  • \(\mathcal {AS}\) generates a signature on \(P_{1}\) and \(PID_{1}\) as \(\sigma _{T_{1}} = sH(P_{1}||PID_{1})\), and sends \(\sigma _{T_{1}}\) to \(\mathcal {U}_{i}\).

  • \(\mathcal {U}_{i}\) verifies \(e(\sigma _{T_{1}}, P)\ {\mathop {=}\limits ^{?}}\ e(H(P_{1}||PID_{1}), P_{pub})\). If the verification fails, reject.

  • \(\mathcal {U}_{i}\) creates a transaction \(Tx_{1}\) shown in Fig. 6, where \(\mathcal {U}_{i}\) transfers service charge to the \(\mathcal {AS}\)’s account, and the data field of the transaction is set to \(h(h(P_{1}||PID_{1})||\sigma _{T_{1}})\).

  • After \(Tx_{1}\) is recorded into the Ethereum blockchain, \((P_{1}||PID_{1}||Bl_{1}, \sigma _{T_{1}})\) is sent to \(\mathcal {C}\), and is published as the provenance record.

Fig. 6.
figure 6

The transaction conducted by the creator

Fig. 7.
figure 7

The transaction conducted by the editor/viewer

Editor/Viewer: \(\mathcal {U}_{i}~{\text {edits/views an existing document}}\)

If \(\mathcal {U}_{i}\) edits/views an existing document, without loss of generality, we assume the underlying document is F and the provenance record \(P_{j} = h(F_{j}||ID_{i})\) with \(j \ge 2\), where \(F_{j}\) denotes the content of the document at the j-th stage. \(\mathcal {U}_{i}\) interacts with \(\mathcal {AS}\) as follows.

  • \(\mathcal {U}_{i}\) sends \((P_{j}, Bl_{j-1}, \sigma _{T_{j-1}})\) to \(\mathcal {AS}\) via the secure channel, where \(Bl_{j-1}\) denotes the hash value of the block that contains the transaction whose data field is \(h(h(P_{j-1}||PID_{j-1})||\sigma _{T_{j-1}})\).

  • \(\mathcal {AS}\) checks the validity of \(Bl_{j-1}\), if the checking fails, reject.

  • \(\mathcal {AS}\) extracts \(PID_{j}\) from local storage.

  • \(\mathcal {AS}\) computes \(\varTheta (P_{j}) = H(P_{j}||PID_{j}||Bl_{j-1})\), generates a signature \(\sigma _{T_{j}} = s\cdot \varTheta (P_{j})\), and \(\mathcal {AS}\) sends \(\sigma _{T_{j}}\) to \(\mathcal {U}_{i}\).

  • \(\mathcal {U}_{i}\) verifies \(\sigma _{T_{j}}\) by checking whether \(e(\sigma _{T_{j}}, P){\mathop {=}\limits ^{?}} e(\varTheta (P_{j}),P_{pub})\).

  • \(\mathcal {U}_{i}\) creates a transaction \(Tx_{j}\) shown in Fig. 7, where \(\mathcal {U}_{i}\) transfers service charge to the \(\mathcal {AS}\)’s account, and the data field of the transaction is set to \(h(h(P_{j}||PID_{j})||\sigma _{T_{j}}||Bl_{j-1})\).

  • After \(Tx_{j}\) is recorded into the blockchain, \((P_{j}||PID_{j}||Bl_{j}, \sigma _{T_{j}}, Bl_{j-1})\) is sent to \(\mathcal {C}\), and is published as the provenance record.

Finally, the secure provenance becomes

$$\begin{aligned} (P_{1}||PID_{1}||Bl_{1}, P_{2}||PID_{2}||Bl_{2}, \cdots , P_{j}||PID_{j}||Bl_{j}, \sigma _{T_{j}}). \end{aligned}$$

Secure Provenance Verification: Given the provenance

$$\begin{aligned} (P_{1}||PID_{1}||Bl_{1}, P_{2}||PID_{2}||Bl_{2}, \cdots , P_{j}||PID_{j}||Bl_{j}, \sigma _{T_{j}}), \end{aligned}$$

the auditor \(\mathcal {A}\) checks its correctness as follows:

  • Locate the last block \(Bl_{j}\) on the Ethereum blockchain, and verifies the validity of the last recorded provenance record \(P_{j}||PID_{j}||Bl_{j}\).

  • Compute \(\varTheta (P_{j}) = H(P_{j}||Bl_{j-1})\).

  • Check whether \(e(\sigma _{T_{j}}, P){\mathop {=}\limits ^{?}}e(\varTheta (P_{j}), P_{pub})\).

  • Extract the data information from blockchain according to \(Bl_{1}, ..., Bl_{j}\).

  • Verify the integrity and timeliness of provenance by checking whether the hash value of provenance matches the extracted data. Here, the physical time when the provenance record was generated is derived from the height of the corresponding block. Specifically, assuming the time when \(P_{j}||PID_{j}||Bl_{j}\) was generated is denoted by \(\tau _{j}\) and the height of the corresponding block is denoted by \(\rho _{j}\). \(\tau _{j} = \tau _{0} + \gamma \cdot \rho _{j}\) (seconds), where \(\tau _{0}\) is the physical time when the genesis block of Ethereum was generated (i.e., 2015-07-30, 03:26:13 PM +UTC) and \(\gamma \) is the average time that a block is mined in Ethereum.

If all the provenance records pass all the above checking, it can be accepted.

5.3 Remark

In ESP, \(\mathcal {A}\) can check the time when a provenance record was generated by extracting the timestamp of the corresponding block from the blockchain. However, in Ethereum, the timestamp of a block cannot accurately reflect when transactions included in the block were generated, since the timestamp of the block might be confronted with up to 900 seconds errors. To overcome the time errors, in ESP, the auditor derives the transaction time from the height of the block including the transaction. The key observation is that the average time that a block is mined is deterministic and can be counted, and the blockchain height can be trusted to increase with respect of either short or long term, which is formalized as the chain-growth of blockchain [18]. By doing so, the time when a provenance record was generated suffers from around 15 seconds errors in ESP, which has improved the accuracy of timestamp significantly.

6 Security Analysis

ESP is secure against provenance record forgery, removal, modification, and repudiation attacks, even if the authenticated server is compromised. ESP also guarantees conditional privacy preserving. We defer the detailed security analysis to Appendix B. Security Analysis of ESP.

7 Implementation and Evaluation

We implement ESP by using JAVA, and the experiments are conducted on a laptop with Window 7 system, an Intel Core 2 i5 CPU and 8GB DDR3 of RAM. The security level is chosen to 80 bits and the hash function h is selected to SHA3-256. The implementation of ESP is illustrated in Fig. 8 and is described below. For clarity, we prefix calls with \(\mathcal {AS}\) when they are made by \(\mathcal {AS}\) and with \(\mathcal {U}\) when they are made by \(\mathcal {U}\).

Fig. 8.
figure 8

Implementation of ESP

\(\mathsf {Com\_SessionKey}\) is an interactive algorithm to compute the session key between \(\mathcal {U}\) and \(\mathsf {Com\_Pseudonym}\) is an algorithm to compute a pseudonym for \(\mathcal {U}\). The signature on the provenance record \(P_{j}\) and the pseudonym \(PID_{j}\) is implemented by \(\mathsf {Sig(P_{j}||PID_{j})}\). The verification of provenance record \(T_{j-1}\) is implemented by \(\mathsf {Verify}\ \sigma _{T_{j-1}}\). Generating/editing the target file on \(\mathcal {U}\) is implemented by \(\mathsf {Generate.File}\).

Fig. 9.
figure 9

Communication overhead of creator and editor/viewer

Fig. 10.
figure 10

Communication overhead of authenticated server and cloud server

Fig. 11.
figure 11

Computational overhead

We show the communication overhead of the creator and editor/viewer in Fig. 9, where the size of human-memorisable password is set to 120 bits. We also show the communication overhead of the authenticated server and the cloud server in Fig. 10.

In ESP, generating the system parameters is a one-time computation, here we would not show the computation costs in initializing ESP. Instead, we show the computation delay on the users and authenticated server in Fig. 11. In ESP, generating a provenance record takes within 50 ms. ESP is constructed on the Ethereum blockchain. In Ethereum, a block as well as its transactions is considered confirmed if at least 12 consecutive blocks are mined following it. The average time that a block is mined is 15 seconds and hence a transaction takes averagely 15 seconds to be chained to the Ethereum blockchain. As such, publishing a new provenance record takes average 3.25 min in ESP, and the time interval between two successive provenance records only requires around 3.25 min. Another user may have to wait at least 3.25 min to work on the same document. It is the most important factor that affects the practicality of a secure provenance scheme, which is called window of latching (WoL) and is defined in Definition 1.

Another factor that affects the practicality of ESP is the costs to publish provenance record. The transaction fee in Ethereum can be set to be values from 0.000021 Ether to 0.000756 Ether, and the averagely fee is 0.000378 Ether. As of May, 2018, publishing a provenance record requires a user to pay average 25 US cents, which is acceptable to users with respect to the value of the data that ESP protects.

The above experiment results demonstrate that ESP is efficient in terms of communication and computation overhead. We have evaluated WoL of ESP, the evaluation results show that WoL is short and is acceptable in reality. The above analysis also indicates that WoL of ESP is mainly subject to the transaction confirmation time in the blockchain system, and the costs to publish provenance records are at the mercy of transaction fees in the blockchain system.

8 Conclusion

In this paper, we have proposed an efficient and secure data provenance scheme (ESP) for cloud storage systems. ESP employs the blockchain-based provenance record chain to ensure the correctness, integrity, and timeliness of provenance records. ESP protects users’ real identities against the cloud storage server, which preserves users’ privacy. Detailed security analyses have shown that ESP is secure and robust from various attacks with privacy preservation. Compared with existing schemes, ESP can resist the malicious identity manager. We have introduced the concept of window of latching (WoL) to evaluate the practicality of secure provenance schemes. We also have implemented ESP and show that WoL of ESP is short and can be acceptable in reality, which has demonstrated ESP is practical and efficient.