1 Introduction

Nowadays, a large number of smart devices on the user side form the Internet of Things. The massive amount of big data generated by the Internet of Things also faces network problems (Gupta et al. 2018; Stergiou et al. 2018a, b), such as the storage, transmission and security of big data. As a result, cloud computing with powerful computing and storage capabilities is applied to big data. However, since it primarily provides remotely distributed resources away from the user terminal equipment, the main limitation is the delay between the user request and the cloud response. In this case, due to the limited network bandwidth, centralized cloud storage cannot process large amounts of data in time. Moreover, the centralized cloud services may seriously face various security issues and challenges (Gou et al. 2017). To solve these problems, fog computing was proposed by Cisco in 2011. Fog computing, a kind of distributed storage, can improve network performance to meet requirements by closer to the user’s data, computing and networking capabilities. Compared to cloud computing, fog computing located near mobile terminal users with computing and storage capabilities can provide data owners with more convenient data outsourcing services (Yaseen et al. 2018). Meanwhile, the cloud server can more efficiently utilize cloud storage by receiving and storing unique data copy only from fog servers.

Deduplication means that the server only needs to store a single copy of the duplicate data and provide the owner with an access link to it to make efficient use of storage space. So far, many solutions have been proposed to realize deduplication in the cloud environment (Cui et al. 2017; Shin et al. 2017a; Fu et al. 2017; Zhang et al. 2018; Pooranian et al. 2018). However, most of the existing deduplication schemes cannot be expanded directly to big data duplication in fog computing due to their efficiency. For example, when a user submits a data to cloud, the cloud server needs to traverse all fog severs to find out whether there exists the duplication of the data; thus, its time complexity is O(n*m), where m represents the number of fog servers, n represents the number of data in each fog server. As a result, the efficiency of deduplication is very low. Therefore, it is urgent to propose secure and efficient deduplication solutions applied in fog computing.

In this paper, we propose a secure and efficient big data deduplication scheme in fog computing. The main contributions of this paper are listed as follows:

  1. (1)

    We propose a new decentralized deduplication structure to improve the efficiency of searching duplicate data and apply it to construct a secure and efficient deduplication scheme in fog computing. In our scheme, the cloud server can quickly determine which fog server needs to be traversed to search duplicate data, and instead of traversing all fog servers. This significantly improves the efficiency of big data deduplication in multi-fog server environment.

  2. (2)

    We design the proof of ownership protocol and embed the protocol into the deduplication process. Our scheme can verify whether the user possesses the ownership of data in security and efficiency, because the ownership proof generated during each round of challenge proof is completely different, so that our scheme can resist replay attack, forgery attack etc. Moreover, our scheme can efficiently achieve big data deduplication with less public parameters.

  3. (3)

    Security proof and analysis show that our scheme is secure and reliable. Performance analysis and experimental results show that our scheme significantly reduces overheads compared with the existing schemes.

The rest of this paper is arranged as follows. Section 2 mainly introduces the related work. In Sect. 3, we define the main algorithms and techniques used in our scheme. Section 4 shows the system model and threat model of our scheme as well as the specific construction. Then, the security of our scheme is analyzed in Sect. 5, and the performance of our scheme is evaluated in Sect. 6. Finally, Sect. 7 gives a conclusion and a future outlook.

2 Related work

Douceur et al. (2002) first proposed the convergent encryption (CE) scheme to realize the secure deduplication. In their scheme, the key that encrypts the data is the hash value associated with the data itself, thus ensuring the same data corresponding to the same ciphertext. As an important technology for secure deduplication research, convergent encryption (CE) has received extensive attention (Shin et al. 2017b; Kwon et al. 2017; Stanek and Kencl 2018; Yu et al. 2018). However, an illegal user can use the user’s data tag information without a complete data to use the deduplication mechanism to launch an attack on the server’s complete data.

To guarantee data ownership, Halevi et al. (2011) first proposed the concept of proof of ownership (PoW) model to verify the user’s data indeed. Then a variety of schemes were put forward to prove the ownership of the data by using Merkle trees and encoding techniques. After that, Di Pietro and Sorniotti (2016) proposed an improved version of the PoW model based on the challenge-response strategy, making the data ownership proof more secure and reliable, but the computational overhead of server-side I/O is relatively high. Xiong et al. (2017) introduced a new concept of shared ownership certificate (PoWS) and built a secure multi-server-assisted PoWS (ms-PoWS) scheme based on aggregate encryption, secret sharing and bloom filter. However, secure deduplication occurs on the user side, which results in low efficiency of deduplication. Mishra et al. (2018) proposed a block-level deduplication scheme that implements merge ownership and storage proof (MPoWS). This scheme enables users and servers to perform mutual authentication. At the block-level deduplication, relatively large data needs to be uploaded to the server and different block tags are used for check operations.

The above-mentioned schemes mainly solve the problems of the data deduplication in cloud storage and cannot be expanded directly to data duplication in fog computing due to their efficiency. Recently, Koo et al. (2016) applied deduplication technique to fog computing for the first time and proposed a secure and efficient hybrid deduplication scheme for data outsourcing in fog computing. Yang et al. (2017) put forward a cross-domain big data secure deduplication scheme that is effective and can implement privacy-preserving in the cloud environment, called EPCDD scheme. By using a three-level cross-domain architecture, a wide range of data management can be completed in this scheme. Moreover, on the basic of the binary search tree, the search efficiency of duplicate data in this scheme can be largely improved. However, in the process of implementing cross-domain deduplication, the number of public system parameters is linear to the number of domains, resulting in excessive computational overhead. Subsequently, the scheme (Koo and Hur 2018) proposed an encrypted data secure deduplication scheme for dynamic ownership management, which can save the storage cost and protect data privacy in fog computing. This scheme uses user-level key management technology and update mechanism to achieve fine-grained access control. What’s more, regardless of how the number of outsourced data changes, the solution enables the data owner to maintain a constant number of keys. However, the parameters of each data stored in fog computing are stored again in cloud computing, which not only wastes the storage space of cloud computing, but also seriously reduces the efficiency of the system check. Recently, Ni et al. (2018) have proposed a secure deduplication scheme for mobile devices through fog computing. By designing a pseudo-random function, the fog node can detect and delete duplicate data without exposing the data information. And this scheme uses the chameleon hash function to achieve privacy protection for anonymous mobile users. However, the overhead of the client and server side of the scheme is quite large.

Although the deduplication of the fog computing has been closely researched, there are few secure and efficient deduplication schemes around the fog computing. But there are still many problems in big data deduplication of the fog computing. Therefore, it is urgent to study how to achieve a secure and efficient big data deduplication in fog computing.

3 Preliminaries

In this section, we introduce some major techniques and algorithms in the paper.

3.1 Decision tree

Decision tree (DT) (Jiang et al. 2017) is similar to the binary search tree, shown in Fig. 1. The structure of DT is made up of node Ni (circle) and branch Wi (line), and DT is generated dynamically during the search process. In DT, the value of root node is larger than the value of nodes on its left subtree and smaller than the value of nodes on its right subtree, and its left and right subtrees are also the decision trees, respectively. DT takes Ni as the dividing point, and the left and right sides are different nodes sorted according to the size of nodes. Starting from Ni to the last Ni or leaf node ends, a traversal result can be obtained. DT can satisfy three main operations: search, insertion and deletion.

Search: When the DT is not empty, the target node value is compared with the node value in DT. If smaller, continues to search the left branch; if larger, continues to search the right branch. It ends only when an equal Ni is found or an equal Ni cannot be found in the whole DT.

Insertion: When a new node Ni is needed to be inserted into DT, it will first execute the search operation and find the inserted location. The new Ni can be inserted as an intermediate node or a leaf node.

Deletion: To delete a node Ni, it will first search the node Ni in DT and then delete. After deletion, the structure features of DT are still maintained.

Fig. 1
figure 1

Decision tree

3.2 PoW protocol

Proof of Ownership (PoW) (Halevi et al. 2011) protocol is an effective way to improve the security of user data in the process of secure deduplication on the server side. The whole process is an interaction algorithm between the server and users. When the algorithm starts to execute, the server initiates a challenge request to the user, and the user calculates the corresponding response through his own data and uploads it as a proof to the server. The server performs a verification operation after receiving the reply. If the algorithm outputs true, the data ownership is proved; otherwise the ownership proof fails. The formal interaction of Pow protocol is shown in Fig. 2.

Fig. 2
figure 2

Proof of ownership (PoW) protocol

Fig. 3
figure 3

System model

4 Our scheme

4.1 System model

The system model of our scheme is mainly composed of four parts: user, fog server, cloud server and trusted key distribution center (KDC), as shown in Fig. 3.

User: Each user belongs to a fog which is attached to a fog server. The user’s smart device generates a large amount of data. In order to save local storage overhead, the user can outsource the big data to the cloud server through the fog. If the user is the first one to upload a data (initial user), he needs to generate tag, ciphertext and ownership proof about the data. If the user is subsequent to upload a data (subsequent user), he needs PoW verification to prove to the fog server that he really owns the entire data.

Fog Server: Fog server is a distributed entity managed by a cloud server. It can provide data duplicate checking operations and PoW verification and transmit ciphertext or messages to the cloud server.

Cloud Server: Cloud server has strong storage and computing power to provide services for fog servers and users. When data duplication is not found in the fog server, the cloud server can perform duplication checking concurrently on fog servers. Cloud server provides the ciphertext storage of data.

KDC: The task of a trusted key distribution center is to assign and manage the public and private keys and public parameters of the system.

4.2 Threat model

Our scheme mainly considers the following two types of attackers.

  1. (1)

    Internal attacker: denote fog servers or cloud server. They are usually able to execute data storage and deduplication honestly, but are curious about sensitive information. It can illegally attempt to get users’ data content which invades users’ data privacy.

  2. (2)

    External attackers: denote malicious users. Via the public channel to obtain some data information that they are interested in, and the attackers attempt to use these information to obtain content of data.

4.3 Formal definition

The formal definition of our scheme contains six phases as follows:

  1. (1)

    Initialization phase:

    \(PPGen(1^\lambda ) \rightarrow pp\): Input secure parameter \(1^{\lambda }\), output public parameter pp.

  2. (2)

    Data outsourcing phase:

    \(TagGen(m_{i}, pp) \rightarrow \tau _i \): Input the public parameter pp and data \(m_i\), output the data tag \( \tau _i \);

    \(KeyGen(m_{i}, pp)\rightarrow sk_i\): Input the public parameter pp and data \(m_i\), output the symmetric key \(sk_i\);

    \(Enc(m_i, sk_i) \rightarrow C_i\): Input the symmetric key \(sk_i\) and data \(m_i\), output the ciphertext \(C_i\);

    \(Proof(m_i, R_i, pp) \rightarrow \sigma _i\): Input the data \(m_i\),the public parameter pp and the parameter \(R_i\), output the ownership proof \(\sigma _i\).

  3. (3)

    Construction phase of decentralized deduplication structure:

    \(DDSGen(H(\tau _i), R_i, \sigma _i) \rightarrow \) DDS: Input the hash value \(H(\tau _i)\) of the data tag \( \tau _i \), the parameter \(R_i\), and the proof of ownership \(\sigma _i\), output the decentralized deduplication structure DDS.

  4. (4)

    Secure deduplication phase:

    \(Dedup(H(\tau _i )) \rightarrow \) (true, false): Input the hash value \(H(\tau _i)\) of the data tag \( \tau _i \). If the duplicated data of the data \(m_i\) exists in the fog server or the cloud server, then output true. Otherwise, output false.

  5. (5)

    PoW phase:

    \(ProofGen(\mathbf{chal })\rightarrow \sigma _i''\): Input the challenge \(\mathbf{chal }\), output the produced proof \(\sigma _i''\);

    \(CheckGen(\sigma _i'', \sigma _i)\rightarrow \) (true,false): Input the produced proof \(\sigma _i''\), the ownership proof \(\sigma _i\). If \(\sigma _i = \sigma _i''\), then the proof of PoW is successful, output true. Otherwise, it is failing, output false.

  6. (6)

    Data retrieval phase:

    \(Dec(C_i, sk_i)\rightarrow m_i\): Input the ciphertext \(C_i\) and the symmetric key \(sk_i\), output data \(m_i\).

4.4 Construction

In our design, the initial user encrypts data and then outsources it to the fog server and provides data-specific parameters to subsequent users of the same data to regenerate the proof of ownership. After the initial user uploads data, subsequent users attempt to upload the same data already stored in the cloud server. The ownership verification is needed between the subsequent user and the fog server to prove that the user owns the ownership of the data. Since the data deduplication is performed on the fog server side, the computational overhead of the user is greatly reduced. The specific construction of our scheme is as follows.

4.4.1 Initialization phase

\(PPGen(1^\lambda ) \rightarrow pp\): KDC runs RSA algorithm (Boneh et al. 2003) to generate m pair of public and private keys such as \((e_1, d_1),(e_2, d_2), ...... , (e_m, d_m)\) for the fog server \({Fog}_{1},\)\( {Fog}_{2}, ...... , {Fog}_{m}\), respectively. Let G be one cyclic multiplicative group of large prime p. g is a generator of G. KDC selects secure hash functions: \(H(.), h(.): \{0,1\}^{*} \rightarrow \{0,1\}^{k}\). Finally, KDC assigns the public and private keys to the corresponding fog servers through the secure channel and publishes the system public parameter \(pp = {(G, g, h, H, e_1, e_2, ..., e_m)}\).

4.4.2 Data outsourcing

  1. (1)

    \(TagGen(m_{i}, pp) \rightarrow \tau _i \): The user computes \(h(m_i)\) and generates the data tag \(\tau _i = g^{h(m_i)}\) and then encrypts \(\tau _i \) with public key \(e_i\) of the fog server \(Fog_i\), using RSA algorithm such as \(T_{i} = Enc_{e_i}(\tau _i )\). Finally, the user sends \(T_i\) to the fog server \(Fog_i\).

  2. (2)

    \(KeyGen(m_{i}, pp)\rightarrow sk_i\): For the data \(m_i\), the user calculates the symmetric key \(sk_i=h(m_i || ID_{csp})\), where \(ID_{csp}\) is the cloud server ID returned by the fog server to the user.

  3. (3)

    \(Enc(m_i, sk_i) \rightarrow C_i\): The user encrypts the data \(m_i\) using the symmetric key \(sk_i\) and symmetric encryption algorithm, such as AES algorithm, and generates the ciphertext \(C_i=Enc_{sk_i}(m_i)\). It should be noted that after the initial user encrypts the data, subsequent users of the same data do not need to generate ciphertext again.

  4. (4)

    \(Proof(m_i, R_i, pp) \rightarrow \sigma _i\): If the duplicated data of data \(m_i\) exists on the cloud server, the fog server or the cloud server sends the parameter \(R_i\) corresponding to the data \(m_i\) to the subsequent user. The subsequent user computes the ownership proof \(\sigma _i=h(m_i||R_i)\) using the parameter \(R_i\). Otherwise, the initial user chooses a random integer \(R_i\) generates the ownership proof \(\sigma _i=h(m_i||R_i)\) corresponding to data \(m_i\).

4.4.3 Decentralized deduplication structure

In this section, we present a new decentralized deduplication structure named DDS to improve the efficiency of searching duplicate data in fog computing environment. DDS consists of the interval table (IT) and multiple deduplication decision trees (DDT). Here DDT is similar to the scheme (Jiang et al. 2017), but we optimized their search efficiency.

\(DDSGen(H(\tau _i), R_i, \sigma _i) \rightarrow \) DDS: Similar to the construction of a binary search tree, DDT initials as an empty tree. If data are inserted, then DDT is updated. Suppose there are p different data tri-tuples \(\{(H(\tau _1),R_1,\)\( \sigma _1), (H(\tau _2), R_2, \sigma _2), ...... , (H(\tau _p), R_p, \sigma _p)\}\). Then DDT is constructed by p different data tri-tuples, which is shown in Fig. 4. DDT satisfies the following properties: (1) if its left subtree is not empty, the hash value \(H(\tau _i)\) of all nodes in the left subtree is less than that of its root node; (2) if its right subtree is not empty, the hash value \(H(\tau _i)\) of all nodes in the right subtree is greater than that of its root node. For example, \(H(\tau _2)<H(\tau _1)\), and \(H(\tau _3)>H(\tau _1)\).

Fig. 4
figure 4

Construction of DDT-1 in \(Fog_1\)

If a new \((H(\tau _j), R_j, \sigma _j)\) needs to be stored on DDT, it first compares \(H(\tau _j)\) with the root node \(H(\tau _1)\) of DDT. If \(H(\tau _j)>H(\tau _1)\), and then continues to judge to the right subtree of the root node to obtain \(H(\tau _j)<H(\tau _3)\), so that \((H(\tau _j), R_j, \sigma _j)\) is inserted as the left node of \((H(\tau _3), R_3, \sigma _3)\). The algorithm for querying DDT is shown as Algorithm 1. In DDT, the time complexity of finding duplicate data is O(logp).

figure a

According to Algorithm 1, each fog server stores p data tri-tuples in turn at appropriate nodes and thus constructs a DDT. In addition, each fog server, such as \(Fog_i\), sets the minimum hash value \(H(\tau _x)^i_{min}\) and maximum hash value \(H(\tau _y)^i_{max}\) of data tags \(\tau _i\) in DDT as a data range in \(Fog_i\) and sends \((H(\tau _x)^i_{min}, H(\tau _y)^i_{max})\) to the cloud server.

The interval table (IT) is an index table. In IT, each index item is set up for each fog server, which contains key items and pointer items. Each key item consists of the minimum and maximum hash value of the data tags on DDT stored in a fog server, and each pointer item is an address that can be linked to DDT. For example, supposing DDT-i is stored on the fog server \(Fog_i\), then the corresponding key item is \((H(\tau _x)^i_{min}, H(\tau _y)^i_{max})\), and the pointer item is an address that can be linked to DDT-i. Based on the key items \(\{(H(\tau _x)^1_{min}, H(\tau _y)^1_{max}),\)\( (H(\tau _x)^2_{min}, H(\tau _y)^2_{max}), ..., \)\((H(\tau _x)^m_{min}, H(\tau _y)^m_{max})\}\), and the corresponding pointer items, IT is constructed and is stored in the cloud server. Thus IT and all the fog server’s DDT tree construct a new decentralized deduplication structure (DDS), shown in Fig. 5.

Fig. 5
figure 5

Decentralized deduplication structure (DDS)

4.4.4 Secure deduplication

\(Dedup(H(\tau _i )) \rightarrow \) (true, false): If a user belonged to \(Fog_1\) wants to upload the data \(m_i\) to the cloud server, then the user first calculates \(\tau _i =g^{h(m_i)}\) of the data \(m_i\), and then encrypts \(\tau _i\) with public key \(e_1\) of the fog server \(Fog_1\) using RSA algorithm such as \(T_i = Enc_{e_1}(\tau _i)\). Finally, the user sends \(T_i\) to \(Fog_1\).

Upon receiving \(T_i\), \(Fog_1\) decrypts \(T_i\) by its private key \(d_1\), and gets \(\tau _i\). Following, \(Fog_1\) computes the hash value \(H(\tau _i)\) and searches \(H(\tau _i)\) in DDT-1. If there is no match on \(Fog_1\), then \(Fog_1\) sends \(H(\tau _i)\) to the cloud server. The cloud server first searches the index item of IT satisfying condition \(H(\tau _x)^i_{min}<H(\tau _i)<H(\tau _y)^i_{max}\). If the search is successful in the index item \((H(\tau _x)^i_{min}, \)\(H(\tau _y)^i_{max})\) corresponding DDT-i stored on the fog server \(Fog_i\), then \(H(\tau _i)\) is sent to \(Fog_i\) to match on DDT-i. If the match is successful, it proves that other users have stored the data \(m_i\) on the cloud server, that is, the algorithm \(Dedup(H(\tau _i))\) outputs true. Otherwise outputs false.

4.4.5 PoW phase

\(ProofGen(\mathbf{chal })\rightarrow \sigma _i''\): If the duplicated data \(m_i\) exists in the cloud server, the user needs to complete ownership verification to prove to the fog server that he/she owns the ownership of the data. Firstly, the fog server randomly selects a number \(r_i \in Z_p^*\), and then finds \(R_i\) corresponding to \(m_i\), finally sends a challenge \(\mathbf{chal } = (R_i , r_i)\) to the user. On receiving the challenge \(\mathbf{chal }\), the user computes \({k_i}' = h(\tau _i||r_i)\), and generates the produced proof \(\sigma _i'' = \sigma _i'\oplus {k_i}'\), and sends \(\sigma _i''\) to the fog server.

\(CheckGen(\sigma _i'', \sigma _i)\rightarrow \) (true,false): On receiving the produced proof \(\sigma _i''\), the fog server calculates \(k_i = h(\tau _i||r_i)\) and checks if \(\sigma _i = \sigma _i''\oplus k_i\), where \(\sigma _i\) is corresponding to \(m_i\). If the verification is successful, it shows that the user has the ownership of the data \(m_i\), and the user does not need to upload data \(m_i\) again. Otherwise, the user does not have the ownership of the data \(m_i\).

The specific operation of PoW is shown in Algorithm 2.

figure b

4.4.6 Data retrieval

\(Dec(C_i, sk_i)\rightarrow m_i\): The user sends a data retrieval request to the associated fog server, and then the fog server sends the retrieval request to the cloud server. The cloud server returns the obtained ciphertext \(C_i\) to the user through the fog server. After receiving the ciphertext \(C_i\), the user decrypts the ciphertext \(C_i\) by the decrypted key \(sk_i\), thereby recovering the data \(m_i\).

5 Security analysis

5.1 Security proof

Theorem 1

Assuming that the hash function used in our scheme is a cryptographically secure hash function, and RSA public key encryption based on large integer factorization is secure, then our scheme can guarantee the proof of ownership of the original data.

Proof

If an attacker A can break our scheme with a non-negligible probability, then the attacker can successfully spoof the fog server with a non-negligible probability without the original data and pass the verification process of ownership proof. Following we will show how to use this attacker’s ability to successfully implement the collision attack of the strong cryptographic hash function and the decomposition of large integer factors with a non-negligible probability. But this contradicts the premise of Theorem 1, which proves the security of our scheme. \(\square \)

Specifically, in order to successfully spoof the fog server with a non-negligible probability without the original data, and through the verification process of the ownership proof and recover the original data encryption key, the attacker A needs to generate or guess on his own a value y such that \(h(y) = h(\tau _i||r_i)\). In this case it means either \(y = \tau _i||r_i\) or \(y \ne \tau _i||r_i\). When \(y \ne \tau _i||r_i\), then y is a strong collision guess for the hash calculation \(h(\tau _i||r_i)\); when \(y = \tau _i||r_i\), it means that the attacker A does not have the original data \(m_i\), and can successfully forge \(\tau _i||r_i\). If the attacker A can forge \(\tau _i||r_i\) with a non-negligible probability, he can successfully perform a strong collision attack on the cryptographically secure hash function with a non-negligible probability. However, this contradicts the hypothesis of the cryptographically secure hash function. Obviously, it is not feasible for the attacker A to derive \(\tau _i\) from \(h(\tau _i||r_i)\) according to the irreversibility of the cryptographically secure hash function. Moreover encrypting \(\tau _i\) with RSA public key ensures that \(\tau _i\) is transmitted securely over the channel. Thus it is impossible for a malicious user to use a tag to make guessing attacks on data. Therefore, our scheme implements a secure proof of ownership.

Table 1 Theoretical analysis and comparison of related schemes

5.2 Security analysis

We will conduct security attribute analysis on our scheme, including five aspects: privacy-preserving, availability, “original data participation in computing” security, “random tag” security, and “dynamic verification” security.

5.2.1 Privacy-preserving

Internal attacks are usually defined for the server to attempt to obtain plaintext. In our scheme, since the fog server and the cloud server are semi-trusted, they will perform deduplication of data as specified, but try to obtain plaintext out of curiosity. Specifically, the fog server stores a tri-tuple \((H(\tau _i), R_i , \sigma _i)\) for each data, and the cloud server stores an interval table \((H(\tau _x)^i_{min} ,\)\( H(\tau _y)^i_{max})\) for each fog server and all ciphertext \(C_i\). Based on the difficult problem of discrete logarithm and the cryptographically secure hash function, it is impossible for the fog server to obtain \(m_i\) by \(\tau _i = g^{h(m_i)}\) and \(\sigma _i = h(m_i||R_i)\).

Furthermore, \(C_i = Enc_{sk_i}(m_i)\) is encrypted by symmetric encryption key \(sk_i = h(m_i||ID_{csp})\). In the absence of the original data \(m_i\), the fog server and the cloud server are unlikely to compute the encrypted key \(sk_i\), thus they cannot obtain the data \(m_i\). Even if the fog server collude with the cloud server, they cannot obtain the data \(m_i\) through the \((H(\tau _i), R_i , \sigma _i)\) and \(C_i\).

Based on the above analysis, our scheme can protect against internal and external attacks, and collusion attack, so that realize privacy-preserving.

5.2.2 Availability

Availability means that after the user encrypts the data and uploads it to the cloud server and deletes the local data, it must be guaranteed that the user can download and decrypt the ciphertext to get the original data.

Specifically, in our scheme, if the user wants to store the data \(m_i\) to the cloud server, the user first computes the symmetric key \(sk_i\) and encrypts the data into the ciphertext \(C_i\) by \(sk_i\) to ensure data confidentiality, and then generates the ownership proof \(\sigma _i\), finally sends them to the fog server. At the same time, the user keeps the symmetric key \(sk_i\). When the cloud server or the fog server finds that the same data \(m_i\) was previously stored in the cloud server, the user deletes the local data \(m_i\). After that, if the user requests access to the data \(m_i\), then the cloud server returns to the user after the user’s ownership for the data \(m_i\) has been successfully verified. And the user can decrypt \(C_i\) by the symmetric key \(sk_i\) saved himself. Therefore, our scheme has the feature of availability, thus ensuring user to access data.

Table 2 the Comparison of computational overheads

5.2.3 “Original Data Participation in Computing” security

In proof of ownership phase, a user is requested by the fog server to generate the ownership proof \(\sigma _i'' = \sigma _i'\oplus {k_i}'\), where \(k_i' = h(\tau _i||r_i)\), \(\sigma _i' = h(m_i||R_i)\) by the original data \(m_i\). If the user does have the data \(m_i\), he/she cannot generate the ownership proof \(\sigma _i''\). At the same time, the user cannot forge \(\sigma _i\) in advance since \(\sigma _i''\) contains a random number \(r_i\) selected by the fog server and the parameter \(R_i\) corresponding to the data \(m_i\), so that the user cannot spoof the fog server by merely providing the data tag \(\tau _i\) of the original data \(m_i\), nor can it pass the challenge test of ownership proof.

5.2.4 “Random Tag” security

In our scheme, because \(r_i\) is randomly chosen by the fog server each time, the ownership proof generated during each round of challenge proof is completely different. Therefore, the ownership proof \(\sigma _i''\) is random tag; thus ours can resist replay attack.

5.2.5 “Dynamic Verification” security

When the fog server receives \(\sigma _i''\), the fog server calculates \(k_i = h(\tau _i||r_i)\), and then checks whether \(\sigma _i = \sigma _i''\oplus k_i\) is equal. The forged \(\sigma _i''\) cannot pass the verification. Because \(\sigma _i''\) is a random tag, then the ownership proof is dynamically verified.

6 Performance evaluation

In this section, we will accomplish the theoretical analysis and experimental analysis including function, computational overheads, communication overheads and storage overheads in terms of our scheme and related schemes (Koo and Hur 2018; Jiang et al. 2017). Suppose \(T_{exp}\), \(T_{hash}\), \(T_{pair}\), \(T_{RSA}\), \(T_{AES}\) and \(T_{XOR}\) denote the computational overheads of an exponential operation, a hash operation, a pairing operation, a RSA algorithm, a AES algorithm, a XOR operation, respectively.

6.1 Function

Firstly, we show the functions of our scheme through comparing with other analogous schemes (Koo and Hur 2018; Jiang et al. 2017), shown in Table 1.

In Table 1, the scheme (Koo and Hur 2018) and (Jiang et al. 2017) supports the user-side deduplication, while our scheme supports the server-side deduplication. Because under the Internet of Things, mobile users pursue fast response and service. The computational burden on the client side should be reduced, and the deduplication operation can be handed over to the server with more computational performance. Both our scheme and the scheme (Koo and Hur 2018) can achieve the deduplication in multiple fog environment, which can improve deduplication efficiency and reduce delay. It is notable that the query repetition efficiency of our scheme is better than the scheme (Koo and Hur 2018). In our scheme, the efficiency at the fog server side is \(O(log\ n)\), where n is the number of data, while the scheme (Koo and Hur 2018) is O(n). At the cloud side, the efficiency of our scheme is O(m), where m is the number of the fog servers, but the scheme (Koo and Hur 2018) is \(O(m*n)\). All three schemes can achieve secure encryption and deduplication. Of course, only the scheme (Koo and Hur 2018) achieves the access control.

6.2 Computational overheads

From Table 2, we can see that our scheme has lower computational overheads than those of other schemes. This is because our scheme mainly involves some exponential operations, modular exponential operations, XOR operation and hash operations, while the scheme (Koo and Hur 2018) and (Jiang et al. 2017) mainly involve some bilinear pairing operations, exponential operations and hash operations.

In our scheme, the computational overhead of generating a data tag \(\tau _i\) is one exponential operation and one hash operation, encrypting the data tag by RSA algorithm is one modular exponentiation operation, and generating the ownership proof \(\sigma _i\) is one hash operation. At the same, an initial user needs to encrypt his/her data by AES algorithm, and generating the encryption key \(sk_i\) is one hash operation, so that the computational overheads for an initial user are \((T_{exp}+3T_{hash}+T_{RSA}+T_{AES})\). If there is the duplicated data on the cloud server, the subsequent users only need to generate a data tag \(\tau _i\) and encrypt \(\tau _i\), and complete the ownership proof with the fog server, and generate \(sk_i\). Therefore the computational overheads at the user side are \((2T_{exp}+7T_{hash}+2T_{RSA}+T_{XOR}+T_{AES})\), which includes the total overheads of the initial user and subsequent user. At the fog server side, the decryption process requires one modular exponentiation operation to get \(\tau _i\). Meanwhile the fog server performs the calculation of \(H(\tau _i)\) to detect whether the same data is stored. If there is no duplicated data on the cloud server, the computational overheads of the fog server are \((T_{RSA}+T_{hash})\). Otherwise, the fog server needs to finish one hash and one XOR operation to perform the ownership proof with the user; then the computational overheads are \((2T_{hash}+T_{RSA}+T_{XOR})\). While the cloud server needs no computation.

However, in the scheme (Koo and Hur 2018) needs three pairing operations, six exponential operations, \((n+6)\) multiplication operations and one hash operation for an initial user, and three pairing operations, two exponential operation, \((n+4)\) multiplication operations and one hash operation for the subsequent users. It is worth noting that when the subsequent users invoke the PoW algorithm, the hash operation is related to the depth h of the MHT tree, and the computational overhead is recorded as \(T_{hash}*O(h)\). If the computational overhead of a multiplication operation is negligible for an exponential operation, then we consider this computational overheads are \((6T_{pair}+8T_{exp}+2T_{hash}+T_{hash}*O(h))\) at the user side. At the fog server, the fog server also invokes the PoW algorithm, so the computing overhead of the fog server is \((T_{hash}*O(h))\). And at the cloud server, the cloud invokes the Update algorithm, so the computing overhead of the cloud server is \((4T_{exp})\). Therefore, the total computational overheads at the server ( including fog server and cloud server) are \((4T_{exp}+T_{hash}*O(h))\).

While the scheme (Jiang et al. 2017) needs to generate two tags, a key and ciphertext, which will carry out four exponential operations and a hash operation, so we consider this process costs \((4T_{exp}+T_{hash})\). At the same time, the hash operations of generation of the parameter \(S_{0b1b2...bi}\) and \(b_i\) are also related to the depth h of the deduplication decision tree, so the computational overheads of the two hashing operations can be expressed as \(2T_{hash}*O(h)\). In this way, the computational overheads at the user side are \((4T_{exp}+T_{hash}+2T_{hash}*O(h))\). The above computing overhead is for the initial uploader. The following computing overhead for the subsequent uploader is \((2T_{exp}+T_{hash}+2T_{hash}*O(h))\). On the server side, the cloud server needs to compare the \(e(g^{r_*},g^{r_ih(m_i)})\) and \(e(g^{r_i},g^{r_*h(m_*)})\) to detect whether the same data is stored, this process needs to multiple bilinear pairing operations, which is related to the depth h of the decision tree. Therefore, the computational overheads are \((2T_{pair}*O(h))\).

To investigate actual performance in computational overhead, we use the JAVA language to simulate our scheme and schemes (Koo and Hur 2018; Jiang et al. 2017) on the Eclipse using the JPBC library. All experiments were performed on an Inter Core i7 3.40 GHz processor with a memory of 4.00 GB and a 64-bit operating system on Windows 7. Each test was by averaging 200 times. The experiment results are shown in Figs. 6 and 7.

As we analyzed, as the data grows larger, the advantages of our scheme’s computational overheads are more pronounced. This is because that scheme (Jiang et al. 2017) has more bilinear pairing operations on the cloud server side, and scheme (Koo and Hur 2018) has a large number of bilinear pairing operations on the user side, which greatly increase the computational overhead. In summary, compared to the schemes (Koo and Hur 2018; Jiang et al. 2017), our scheme significantly reduces the computational overheads, both on the user side and on the server side.

Fig. 6
figure 6

User-side computational overheads

Fig. 7
figure 7

Server-side computational overheads

6.3 Communication overheads

In this part, we consider communication overheads of the bitstream generated when the user and the fog server or the cloud server interact.

In our scheme, an initial user needs to send the ciphertext \(T_i\) of data tag \(\tau _i\) , the ciphertext \(C_i\) of data, the ownership proof \(\sigma _i\) and random number \(R_i\) to the fog server, and accept \(ID_{csp}\). However, the bit stream for \(ID_{csp}\) transmission is particularly small and can be ignored. So that the communication overhead is \(|T_i|+|C_i|+|\sigma _i|+|R_i|\). For a subsequent user, he/she needs to send the ciphertext \(T_i\) of data tag \(\tau _i\) to the fog server, and then when receiving the challenge \(\mathbf{chal } = (R_i , r_i)\) of the fog server, he/she only needs to send the produced proof \(\sigma _i''\) to the fog server, the communication overhead of this process is \(|T_i|+|R_i|+|r_i|+|\sigma _i''|\). Therefore the total communication overheads are \(2|T_i|+|C_i|+|\sigma _i|+2|R_i|+|r_i| +|\sigma _i''|\) . In the JPBC library, we choose 160-bits \(r \in Z_p^*\), 1024-bits generator, SHA-256 hash function, symmetric encryption using AES, ciphertext size is 256 bits, the mainstream modulus value of RSA encryption is 1024 bits, and the encrypted ciphertext bit length is the same as the key bit length. So we can get the communication overheads when interacting in the channel is \(2|T_i|+|C_i|+|\sigma _i|+2|R_i|+|r_i|+|\sigma _i''| = 3296\) bits.

However in the scheme (Koo and Hur 2018), an initial user needs to seeds the data tag \(h_z(F)\), the ciphertext \(C^{(1)}_F\) along with data-specific public keys \((fpk^{(1)}_F,fpk'^{(1)}_F\), \(upk^{(1)}_{i,F},fuk^{(1)}_F)\) to the fog server. While a subsequent user needs to send the data tag \(h_z(F)\), and \(upk^{(j)}_{i,F}\) on receiving the data-specific public keys \( (fpk^{(j)}_F\), \(fpk'^{(j)}_F)\) from the fog server. Finally, the communication overhead caused by engaging in the PoW protocol between the user and the fog server is \(C_{hash}*O(h)\). Therefore we can get the total communication overheads of the scheme (Koo and Hur 2018) are \(C_{ini}+C_{sub}=|C_F|+2|h_z(F)|+|fpk^{(1)}_F|+|fpk'^{(1)}_F|+ |upk^{(1)}_{i,F}|+|fuk^{(1)}_F|+|upk^{(j)}_{S,F}|+|fpk^{(j)}_F|+|fpk'^{(j)}_F|+C_{hash}*O(h)\) = 7936 bits.

In the scheme (Jiang et al. 2017), the initial user needs to send the data tags \(\tau _i^1\) and \(\tau _i^2\) to the cloud server, and then the cloud server sends a random number \(r_*\) to the user, and the user calculates the ciphertext \(C_i\) and \(b_i\) to send to the cloud server. A subsequent user needs to send the data tags \(\tau _i^1\) and \(\tau _i^2\) to the cloud server, and then the cloud server sends a random number \(r_*\) to the user, and the user calculates \(b_i\) to send to the cloud server. We can get the communication overheads of this scheme are \(2|\tau _i^1|+2|\tau _i^2|+2|r_*|+|C_i|+2|b_i| = 5184\) bits.

Figure 8 shows that in terms of communication overheads, our scheme is the lowest of three schemes, mainly because our scheme designed the proof of ownership agreement and embeds the protocol into the deduplication process, reducing unnecessary communication interaction between the user and the server.

Fig. 8
figure 8

Communication overheads

6.4 Storage overheads

This part analyzes the storage overheads of our scheme and schemes (Koo and Hur 2018; Jiang et al. 2017). For the three schemes, user only needs to save a symmetrically encrypted key, so we do not consider the storage overheads at the user side and only analyze the storage overheads on the server side.

In our scheme, the fog server stores a triplet of data \((H(\tau _i) , \sigma _i , R_i)\) for a data, and the cloud server stores a binary items \((H(\tau _x)^i_{min} , H(\tau _y)^i_{max})\) and the ciphertext \(C_i\). We can know that the storage overheads (including fog server and cloud server) are \(|H(\tau _i)|+|R_i|+|\sigma _i|+|C_i|+|H(\tau _x)^i_{min}|+ |H(\tau _y)^i_{max}| = 1440\) bits. Furthermore, the total storage overheads of our scheme are proportional to the number n of uploaded data, so the relationship between the storage overheads of our scheme and the number of uploaded data is \((1440*n)\) bits.

In the scheme (Koo and Hur 2018), the fog server stores entry table for each data and the ciphertext. We can know that the storage overheads are \(|h_z(F)|+|fpk^{(j)}_F|+|fpk'^{(j)}_F|+|upk^{(j)}_{i,F}|+|MT_{C_F}|+|C^{(i)}_F|\) = 3712 bits. The cloud server also stores the entry table and the ciphertext. Therefore the storage overheads at the cloud server are \(|h_z(F)|+|fpk^{(j)}_F|+|fpk'^{(j)}_F|+|upk^{(j)}_F|+|fuk^{(j)}_F|+|C^{(j)}_F| = 4608\) bits. The total storage overheads of the scheme (Koo and Hur 2018) are proportional to the number n of uploaded data, so the relationship between the storage overheads of scheme (Koo and Hur 2018) and the number of uploaded data is \((8320*n)\) bits.

In the scheme (Jiang et al. 2017), for an uploaded data, the cloud server stores two tags \(\tau _i^1\) and \(\tau _i^2\) of the data, the ciphertext \(C_i\) of the data, and the \(S_i\) corresponding to the data. Then the storage overhead of the server is \(|\tau _i^1|+|\tau _i^2|+|C_i|+|S_i| = 2560\) bits. Similarly, the storage overhead of the scheme (Jiang et al. 2017) is also proportional to the number n of uploaded data, so the relationship between the storage overhead of the scheme (Jiang et al. 2017) and the number of uploaded data is \((2560*n)\) bits.

Figure 9 shows that our scheme is significantly better than the other two in terms of storage overhead. This is because in the other two schemes, too much information is stored on the server side for each data, so as the amount of data increases, the storage overhead increases. Instead, our scheme uses a new decentralized deduplication structure that not only improves check efficiency, but also reduces server side storage overheads.

Fig. 9
figure 9

Storage overheads

7 Conclusion

Fog computing, a new generation of computing paradigm is applied between cloud servers and users. In this paper, a secure and efficient big data deduplication scheme is proposed in fog computing. The proposed scheme not only realizes the efficient deduplication of redundant data in fog computing, but also has higher efficiency compared with the exiting schemes. When a cloud server searches duplicate data, it only traverses a single server instead of traversing all the servers. Furthermore, the proposed scheme can accurately verify whether the user actually owns the entire original data.

In future work, we will focus on similar deduplication under fog computing. Using the repetition rate of similar data to improve the efficiency of security deduplication can not only improve the extraction time of the key, but also reduce the computational overhead.