Elsevier

Pervasive and Mobile Computing

Volume 41, October 2017, Pages 231-242
Pervasive and Mobile Computing

Fast track article
Secure similarity-based cloud data deduplication in Ubiquitous city

https://doi.org/10.1016/j.pmcj.2017.03.010Get rights and content

Abstract

Ubiquitous city, a wonderful vision of future urban, enables citizens to access to any infrastructure and enjoy high quality urban services by integrating information and communication technologies into urban management. However, it inevitably brings a huge amount of data in the Ubiquitous city scenario. It makes how to efficiently manage the ever-increasing datum while preserving data privacy a challenge task. To cope with the above issue, secure data deduplication has attracted considerable interests both academic and industrial community. It can reduce the amount of storage cost by eliminating duplicate data copies, while providing data privacy. Although message-locked encryption has been widely adopted to perform secure cross-client deduplication, it will bring many additional metadata located both client and cloud sides. Recently, some researchers proposed a novel extension of message-locked encryption, named block-level message-locked encryption (BL-MLE), in which block keys are encapsulated into block tags to save metadata storage space. We argue that BL-MLE suffers from high computation overhead for block tag comparison, especially in dissimilar files setting. In this paper, we propose a novel secure similarity-based data deduplication scheme by integrating the technologies of bloom filter and content-defined chunking, which can significantly reduce the computation overhead by only performing deduplication operations for similar files. Security and efficiency evaluations show that the proposed scheme can achieve the desired security goals, while providing a comparable computation overhead.

Introduction

The emerging concept of Ubiquitous city (U-city) has attracted considerable attention, where quality of urban life can be dramatically improved by integrating information and communication technology into urban city. In such a scenario, citizen could access any urban services at anytime and anywhere through multiple kinds of devices. That makes huge data can be collected into the central data administrator from mobile devices in the real-time manner  [1]. Although U-city brings huge benefits for citizen life, security and privacy of U-city becomes a great concern. Undoubtedly, cloud computing, as the most promising computing paradigm, is a good choice for U-city. It can provide high quality data services for citizen while ensuring privacy of their data. Plenty of research work on secure cloud data storage has been well studied in academic community  [2], [3], [4], [5], [6], [7]. As the data increases dramatically in the U-city, how to efficiently manage the even-increasing datum become a critical challenges.

Data deduplication, as a special type of data compression, has been widely adopted to save the storage resource and minimize the transmission of duplicate data by eliminating duplicate data copies  [8], [9], [10], [11], [12]. Traditional data deduplication approach is as follows: when a given data copy is uploaded by the first user, there is no need to perform upload operations, instead only a link from the subsequent uploading user to the original data copy needs to be assigned to him. In the scenario of high data redundancy, deduplication can be used to greatly save data storage space and communication overhead. However, to protect data confidentiality, users need to encrypt their data with their own keys. As a result, an identical data file shared by different users will result in different ciphertexts. It makes cross-user deduplication impossible.

As a promising approach, convergent encryption (CE)  [9] was proposed to encrypt/decrypt the data with a convergent key on the file level which is generated from the file content itself. To generalize CE scheme, Bellare et al.  [8] proposed a new cryptographic primitive called message-locked encryption (MLE), which can be viewed as a generalization of CE. Subsequently, a randomized convergent encryption scheme is given, which can efficiently accomplish the necessary operations (i.e., key generation, message encryption and tag production). However, the above solutions require many metadata to perform (encrypted) data deduplication. To tackle this problem, Recently, Chen et al.  [13] proposed a more efficient deduplication approach named block-level message-locked encryption (BL-MLE), which can achieve dual-level (file-level and block-level) deduplication. Specifically, for two different uploading files, the cloud server needs to perform block deduplication check for each block in sequence. For example, assume that each file consists of n blocks, the number of block checking is n2 at most. Note that each block checking requires a pairing computation (i.e.  n2 pairing computation are required), it is a very hard computation overhead in large data file scenario. What is worse, the fixed-size chunking algorithm adopted in BL-MLE reduces the efficiency of deduplication in practice, which stems from the boundaries-shift problem  [14].

To alleviate the mentioned weakness, we provide a new secure deduplication scheme based on the similarity of files to improve the efficiency of deduplication process. Specifically, the user first upload a file tag for file-level deduplication. If there is not a duplicate one, the cloud server detect whether there is a similar one by compare the file tag uploaded with file tags in the database. Then the server only perform block-level deduplication between two similar files. Besides, we also construct a content-defined file tag generation algorithm in our scheme, so that the similarity can be measured only with the file tag instead of the full-text of file. Meanwhile, we employ a novel variable-sized partition technology in our chunking process. Hence, a good balance between deduplication rate and efficiency can be reached in our scheme.

In this paper, we further study the secure deduplication technology. Our contribution can be summarized in three folds:

  • We propose a novel secure similarity-based data deduplication (SimDedup) scheme in which block-level deduplication is only performed in similar files. Thus, a good balance between the computation overhead and the rate of deduplication can be reached.

  • To compute the similarity of files effectively, we construct a novel content-defined tag generation algorithm in secure deduplication scheme which can detect the similarity of files only with the file tag rather than the similarity coefficient based on the full-text.

  • Security analysis shows that SimDedup is secure in terms of the proposed security model, while can significantly reduce the computation overhead of deduplication.

Douceur et al.  [9] first introduced the notion of convergent encryption, which can ensure data confidentiality while performing deduplication operations. Specifically, each data file will be encrypted with a so-called convergent key that derived from the cryptographic hash value of the file contents. This makes each data copy generates the same ciphertext. Bellare et al.  [8] proposed a novel cryptographic primitive called Message-Locked Encryption (MLE), where the formal definition and security model of deduplication is given. To strengthen the security of R-MLE, Abadi et al. proposed a modified primitives, named R-MLE2  [10] which was considered as a fully randomized scheme. Bellare et al.  [15] proposed an improved deduplication scheme, DupLESS, which can resist the online brute-force attack by running an oblivious pseudo-random function (OPRF). In DupLESS, users generated each key by the aid of a secret parameter which was produced by a third-part key server. To prevent the data from being leaked to the third-part server, they exploited a RSA blind signature which was proposed in  [16]. Bellare and Keelveedhi extended their prior work and proposed a new primitive named Interactive Message-Locked Encryption (iMLE)  [17]. In their model, security was guaranteed whether the message is correlated with the public parameter or not.

In order to counter different types of attacks, several defencive measures have been proposed in recent years. To resist side-channel attacks which were proposed by Harnik et al.  [18] in cross-users deduplication, Stanek et al.  [19] proposed a novel deduplication scheme which differentiates data according to its popularity. For duplicate faking attacks defined in  [20], Wang et al.  [11] proposed a novel data deduplication scheme namely TrDup in which user traceability can be supported by incorporating traceable signatures.

The rest of the paper is organized as follows. We briefly introduce some preliminaries in Section  2. In Section  3, we present the system model and security model for the proposed similarity-based data deduplication scheme. The proposed similarity-based data deduplication scheme and its security analysis are presented in Section  4. The performance evaluation is given in Section  5. Finally, we conclude the paper in Section  6.

Section snippets

Asymmetric extremum content-defined chunking

Content-Defined Chunking (CDC) is a kind of Variable-Sized Partition technology which is different from Fixed-Sized Partition technology used in most secure deduplication schemes. In CDC scheme, chunk boundaries are declared only depending on local content. Asymmetric Extremum Content-Defined Chunking (AE-CDC), regarded as the state-of-the-art CDC scheme, was proposed by Zhang et al.  [14]. In their scheme, bytes were treated as digits like previous CDC schemes  [21], [22]. When a user inputs a

Formal definition

In our SimDedup scheme, the user and the cloud server are main participators. More specifically, the user intends to upload the data to the cloud server, and the server maintains the data uploaded on the cloud server. Depicted in Fig. 3, the storage server in our scheme is deployed under a layered structure. The outside layer contains a collection of file tags, and the inside layer contains collections of block tags. Since similar files often have similar size, we distinguish inside-layer

SimDedup—the proposed construction

In this section, we present the proposed similarity-based secure deduplication scheme, and then its security analysis is given.

Performance evaluation

We implemented a prototype of our SimDedup scheme in Python 3.4. In our prototype, all encryption modules are based on Hashlib module and PyCrypto module  [30]. All our experiments are performed on the server with 8 core Inter Xeon(R) CPU E5-1620 v3 @ 3.50 GHz, and installed with Ubuntu 14.04 LTS.

Conclusion

For ubiquitous computing is usually deployed on the cloud server, secure cloud data deduplication is one of the core issues in ubiquitous city. In this paper, we propose a novel secure similarity-based cloud data deduplication scheme called SimDedup, where the deduplication will be performed on similar data files and can achieve a significantly balance on the deduplication cost and deduplication rate. Furthermore, a new content-defined tag generation algorithm is given. In our scheme, file tag

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 61572382), China 111 Project (No. B16037), National High Technology Research and Development Program (863 Program) of China (No. 2015AA016007), Natural Science Basic Research Plan in Shaanxi Province of China (No. 2016JZ021), Guangxi Cooperative Innovation Center of cloud computing and Big Data (No. YD16506), Guangxi Colleges and Universities Key Laboratory of cloud computing and complex systems (No. YF16101) and

References (31)

  • N. Bjørner et al.

    Content-dependent chunking for differential compression, the local maximum approach

    J. Comput. Syst. Sci.

    (2010)
  • L.G. Anthopoulos et al.

    From digital to ubiquitous cities: Defining a common architecture for urban development

  • M.J. Atallah et al.

    Secure outsourcing of scientific computations

    Adv. Comput.

    (2001)
  • X. Chen et al.

    New algorithms for secure outsourcing of modular exponentiations

    IEEE Trans. Parallel Distrib. Syst.

    (2014)
  • X. Chen et al.

    Verifiable computation over large database with incremental updates

    IEEE Trans. Comput.

    (2016)
  • Z. Fu et al.

    Towards efficient content-aware search over encrypted outsourced data in cloud

  • Z. Xia et al.

    A secure and dynamic multi-keyword ranked search scheme over encrypted cloud data

    IEEE Trans. Parallel Distrib. Syst.

    (2016)
  • Z. Fu et al.

    Enabling personalized search over encrypted outsourced data with efficiency improvement

    IEEE Trans. Parallel Distrib. Syst.

    (2016)
  • M. Bellare et al.

    Message-locked encryption and secure deduplication

  • J.R. Douceur, A. Adya, W.J. Bolosky, D. Simon, M. Theimer, Reclaiming space from duplicate files in a serverless...
  • M. Abadi et al.

    Message-locked encryption for lock-dependent messages

  • J. Wang et al.

    A new secure data deduplication approach supporting user traceability

  • B. Zhu et al.

    Avoiding the disk bottleneck in the data domain deduplication file system

  • R. Chen et al.

    BL-MLE: block-level message-locked encryption for secure large file deduplication

    IEEE Trans. Inf. Forensics Secur.

    (2015)
  • Y. Zhang et al.

    AE: an asymmetric extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication

  • Cited by (21)

    • Secure deduplication schemes for content delivery in mobile edge computing

      2022, Computers and Security
      Citation Excerpt :

      Besides, we assume that there exist secure channels between users and the key managing edge servers. It is reasonable since the edge servers that provide the key storage services should be authenticated in actual scenarios (Li et al., 2015; Liu et al., 2017). Note that these secure channels can be achieved by means of a certificate or other authentication methods, thus the construction of secure channels in our schemes is omitted.

    • Secure data deduplication using secret sharing schemes over cloud

      2018, Future Generation Computer Systems
      Citation Excerpt :

      The deduplication rate of the proposed scheme is around 25%. Widodo et al.’s [41] scheme achieve a deduplication rate of around 23% while Liu et al.’s [18] scheme achieves a deduplication rate of 22.7%. Hence, it can be inferred that the performance of the proposed scheme is slightly better than other schemes.

    View all citing articles on Scopus
    View full text