A similarity-aware encrypted deduplication scheme with flexible access control in the cloud

https://doi.org/10.1016/j.future.2017.10.014Get rights and content

Highlights

  • We analyze the key challenges facing the existing schemes to encrypted deduplication.

  • We present a similarity-aware message-locked encryption algorithm called EDedup.

  • We extend EDedup to support flexible access control with revocation.

  • Evaluations and security analysis demonstrate the efficiency and efficacy of EDedup.

Abstract

Data deduplication has been widely used in the cloud to reduce storage space. To protect data security, users encrypt data with message-locked encryption (MLE) to enable deduplication over ciphertexts. However, existing secure deduplication schemes suffer from security weakness (i.e., brute-force attacks) and fail to support flexible access control. The process of chunk-level MLE key generation and sharing exists potential privacy issues and heavy computation consumption.

We propose EDedup, a similarity-aware encrypted deduplication scheme that supports flexible access control with revocation. Specifically, EDedup groups files into segments and performs server-aided MLE at segment-level, which exploits similarity via a representative hash (e.g., the min-hash) to reduce computation consumption. This nevertheless faces a new attack that an attacker gets keys by guessing the representative hash. And hence EDedup combines source-based similar-segment detection and target-based duplicate-chunk checking to resist attacks and guarantee deduplication efficiency. Furthermore, EDedup generates message-derived file keys for duplicate files to manage metadata. EDedup encrypts file keys via proxy-based attribute-based encryption, which reduces metadata storage overheads and implements flexible access control with revocation. Evaluation results demonstrate that EDedup improves the speed of MLE up to 10.9X and 0.36X compared with DupLESS-chunk and SecDep respectively. EDedup reduces metadata storage overheads by 39.9%–65.7% relative to REED.

Introduction

The ever-increasing volume of data has raised a challenge of scalability to storage management. IDC predicts that digital data will reach 44ZB in 2020 [1]. The development of cloud computing has shown that on-demand resource provisioning ensures the optimal resource allocation [2]. 86% global data center workloads will be processed in the cloud by CISCO [3] in 2020. Microsoft studies [4] have demonstrated that there are more than 50% duplicate data in file systems and up to 90%–95% in backup applications. Deduplication that eliminates redundant data by keeping only one physical copy, has been implemented at file- or chunk-level. For example, Dropbox [5], Wuala [6] and Bitcasa [7] have adopted deduplication to save storage cost.

Although deduplication reduces maintenance cost, security concerns arise for cloud systems [8]. For example, application data from specific cloud instances are sensitive to operator errors and software bugs [9]. Cloud providers are not entirely trusted and the inside attackers may steal sensitive data. To protect data confidentiality, users usually encrypt data with their own keys, which will result in different ciphertexts of identical data. Thus, duplicated data cannot be eliminated by deduplication [[10], [11]]. Existing encrypted deduplication schemes use Message-locked encryption (MLE) [11], which generates keys from data content and enables deduplication over ciphertexts. MLE is inherently vulnerable to brute-force attacks [[12], [13], [14]] for min-entropy files (i.e., predictable files). MLE is deterministic and keyless, and can recover files from a known set. To resist brute-force attacks, DupLESS [13] and ClearBox [15] encrypt data with keys, which are generated from a key server via oblivious pseudo-random function (PRF) protocols (i.e., RSA- [13], BLS-OPRF [15]). However, the process of generating, distributing and maintaining the MLE keys for a deduplication system with access control faces the following challenges.

Heavy time consumption & potential privacy issues. Time consumption of chunk-level key generation will grow with the number of chunks and small files. Because the oblivious PRF protocol used in DupLESS [13] and SecDep [14] is time-consuming. There are a large number of small and similar files on the file system and backup datasets [[16], [17]]. Thus we group files into segments and perform server-aided MLE by using a representative hash (e.g., min-hash) to generate chunk keys, which can reduce time consumption. However, this way is vulnerable to potential privacy issues. Moreover, it will incur storage overheads if the representative hash changes.

Not flexible access control. Cross-user deduplication enables data sharing, which brings new challenges for access control and key management of outsourced data. Cloud servers are operated by commercial providers that are likely to be outside of the protected domain. For example, REED [18] applies attribution-based encryption [19] to enable dynamic access control. REED generates file keys based on users’ RSA private key state, where the storage overheads of file metadata will be enormous. If some users are revoked, the owner is required to generate and distribute new keys, and re-encrypt data. Yan et al. [20] adopt proxy re-encryption [21] to realize access control on the file level. Proxy re-encryption is inefficient when encrypting large files due to exponentiation computation of bilinear pairings. However, existing solutions [[18], [20]] incur large metadata storage overheads and do not support access control with revocation.

To overcome the aforementioned challenges, we present EDedup, an efficient encrypted deduplication scheme that supports flexible access control with revocation. Specifically, EDedup proposes a similarity-aware message-locked encryption algorithm, which performs the oblivious PRF protocol [[13], [15]] at segment-level to generate chunk keys. We find out a new attack, which an attacker can obtain keys by guessing the representative hash (e.g., the min-hash). Thus, EDedup performs source-based similar-segment detection and target-based duplicate-chunk checking to prevent privacy leakage and guarantee deduplication efficiency. Moreover, EDedup generates secure message-derived file keys to manage file metadata (i.e., including recipes and keys), which avoids the metadata storage overheads increasing with the number of users. EDedup employs proxy-based attribute-based encryption (Proxy-based CP-ABE) (e.g. Piratte [[22], [23]]) to realize fine-grained access control based on users’ attributes and access policies. EDedup supports efficient revocation that does not need to distribute new keys for unrevoked users and re-encrypt all data.

This paper makes the following contributions.

  • We propose a new attack caused by leveraging similarity and directly using the min-hash for key generation. To resist the attack and ensure the performance of deduplication, we present an efficient similarity-aware message-locked encryption by combining source-based similar-segment detection and target-based duplicate-check checking.

  • We extend EDedup to support flexible access control with revocation. To ensure data privacy and reduce metadata storage overheads, EDedup utilizes random message-derived file keys and proxy-based attribute-based encryption for key and metadata management.

  • Security analysis demonstrates that EDedup is secure under the proposed threat model. We implement a prototype of EDedup. Experimental results based on real-world datasets show that EDedup improves the speed of MLE by 7.9X-10.9X and 0.26X-0.36X compared with state-of-the-art approaches DupLESS-chunk and SecDep respectively. EDedup reduces metadata storage overheads by 39.9%–65.7% relative to REED.

The remainder of the paper proceeds as follows. Section 2 presents the background and problems. Section 3 describes the system model and goals. Section 4 presents the design and implementation of EDedup. Section 5 presents the security analysis. Section 6 presents experimental results on different real-world datasets. Section 7 discusses the application scenarios and design innovation. Finally, we draw conclusions in Section 9.

Section snippets

Preliminaries

In this subsection, we formally define the cryptographic primitives used in our encrypted deduplication system.

Message-locked encryption. Bellare et al. [[11], [24]] formalize message-locked encryption (MLE), where keys are generated by the hash value of messages. Messages are encrypted via symmetric encryption (e.g., AES). Specifically, a user generates the key K via K Hash(M) and encrypts M by CM Encryaes(K,M). The

System architecture

Our setting is an enterprise network, consisting of a group of affiliated clients who use cloud as the storage service. Deduplication can be frequently used, which greatly reduces storage space. In Fig. 1, it presents the system architecture of EDedup, which consists of three entities: client (users), key server (KS), and cloud storage provider (CSP). If the client interacts with CSP and KS, users’ passwords and credentials will be verified first. Data transmission is ensured by the well-known

Overview

In this subsection, we propose a similarity-aware message-locked encryption algorithm and flexible access control with efficient revocation (see Section 4.2). Then we describe the implementation details (see Section 4.3). The notations used in this paper are listed in Table 2.

As follows, we briefly introduce the cryptographic primitives of EDedup. We propose six-tuple functions to realize the similarity-aware message-locked encryption with flexible access control. In general, a user leverages

Security analysis

EDedup is designed to build a deduplication-based system in the cloud. The security will be analyzed in terms of two aspects, data confidentiality, and privacy of access control. Security of EDedup is ensured by the oblivious PRF protocol, symmetric encryption, and proxy-based attribute-based encryption [22]. We show that EDedup is secure with respect to the potential risks in Section 3.2.

Proposition 1

The attackers cannot break data privacy, even if they generate multiple similar files and

Performance evaluation

Discussions

Application Scenarios. EDedup could be applied in the cloud or non-cloud. EDedup employs a sequential model for large-scale datasets. All files must be processed by the following modules: segmenting, key generation, encryption, uploading, and duplicate checking. To illustrate the difference between cloud and non-cloud storage, Wills et al. [43] propose an organizational sustainability modeling (OSM), which defines the actual execution time related to Cloud and non-Cloud. In addition, EDedup

Related work

Privacy risks. Data privacy and confidentiality become concerns for encrypted deduplication storage systems. Harnik et al. [50] point out that cross-user client-side deduplication can be exploited as a side-channel, and they propose randomized solution to resist side-channel attacks. Mulazzani et al. [51] implement a hash manipulation attack on Dropbox to access unauthorized files. In addition, Halevi et al. [52] introduce “proof of ownership” (PoW) to convince the server that it indeed owns

Conclusions & future work

In this paper, we propose a new attack caused by leveraging data similarity. We present EDedup, a similarity-aware encrypted deduplication scheme that performs server-aided MLE at segment-level to reduce computation consumption. EDedup combines source-based similar-segment detection and target-based duplicate-chunk checking to prevent privacy leakage. In addition, EDedup generates random message-derived file keys for metadata management, and realizes access control with revocation by adopting

Acknowledgments

The authors are grateful to the anonymous reviewers for their valuable comments and feedback. The work was partly supported by 863 Project 2015AA016701, and 2015AA015301; NSFC No. 61232004, 61502190, and No. 6172222; Shenzhen Research Funding of Science and Technology - Fundamental Research (Free exploration) JCYJ20170307172447622.

Yukun Zhou is currently a Ph.D. student majoring in computer architecture in Huazhong University of Science and Technology, Wuhan, China. His current research interests include data deduplication, cloud backup system, and storage security. He publishes several papers in major conferences including IEEE TC, MSST, INFOCOM, IPDPS and IFIP Performance etc.

References (66)

  • Wuala, 2017,...
  • Bitcasa, 2017,...
  • S. Yu, C. Wang, K. Ren, W. Lou, Achieving secure scalable and fine-grained data access control in cloud computing, in:...
  • K.P.N. Puttaswamy, C. Kruegel, B.Y. Zhao, Silverline: Toward data confidentiality in storage-intensive cloud...
  • M.W. Storer, K. Greenan, D.D. Long, E.L. Miller, Secure data deduplication, in: Proceedings of ACM StorageSS,...
  • BellareM. et al.

    Message-locked encryption and secure deduplication

  • Hash of plaintext as key? 2016,...
  • K.S. Bellare, Mihir T. Ristenpart, Dupless: server-aided encryption for deduplicated storage, in: Proceedings of Usenix...
  • Y. Zhou, D. Feng, W. Xia, M. Fu, F. Huang, Y. Zhang, C. Li, Secdep: A user-aware efficient fine-grained secure...
  • ArmknechtF. et al.

    Transparent data deduplication in the cloud

  • AgrawalN. et al.

    A five-year study of file-system metadata

    ACM Trans. Storage (TOS)

    (2007)
  • D. Bhagwat, K. Eshghi, D.D.E. Long, M. Lillibridge, Extreme binning: Scalable parallel deduplication for chunk-based...
  • J. Li, C. Qin, P.P. Lee, J. Li, Rekeying for encrypted deduplication storage, in: Proceedings of DSN, Toulouse,...
  • V. Goyal, O. Pandey, A. Sahai, B. Waters, Attribute-based encryption for fine-grained access control of encrypted data,...
  • YanZ. et al.

    Deduplication on encrypted big data in cloud

    IEEE Trans. Big Data

    (2016)
  • AtenieseG. et al.

    Improved proxy re-encryption schemes with applications to secure distributed storage

    ACM Trans. Inf. Syst. Secur.

    (2006)
  • Piratte porject, 2012,...
  • P.M. Sonia Jahid, N. Borisov, Easier: Encryption-based access control in social networks with efficient revocation, in:...
  • BellareM. et al.

    Interactive message-locked encryption and secure deduplication

  • S. Jahid, N. Borisov, Piratte: Proxy-based immediate revocation of attribute-based encryption, CoRR 2012, pp....
  • M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezis, P. Camble, Sparse indexing: Large scale inline...
  • M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, Y. Zhang, Y. Tan, Design tradeoffs for data deduplication performance...
  • B. Zhu, K. Li, H. Patterson, Avoiding the disk bottleneck in the data domain deduplication file system, in: Proccedings...
  • Cited by (28)

    • Decentralized and expressive data publish-subscribe scheme in cloud based on attribute-based keyword search

      2021, Journal of Systems Architecture
      Citation Excerpt :

      There has been significant in developing attribute-based encryption first introduced in [5] owing to its fine-grained access control property. As in [6-9], many schemes have been proposed in recent years for more expressive, flexible, and practical versions of this technique, for example, attribute revocation, attribute anonymity, and proxy re-encryption. In traditional ABE scheme, the access policy and the attributes are public for everyone.

    • Next-generation big data federation access control: A reference model

      2020, Future Generation Computer Systems
      Citation Excerpt :

      Context-awareness using fuzzy logic conditions as access control approach has been proposed in [38] to address the dynamic outsourcing environment on the edge of the network, and for the intelligent transportation systems [39]. Aiming to protect the redundant data stored over the cloud, an approach by Zhou, Yukun et al. [40] supported flexible access control with revocation. Adopting a proxy re-encryption policy to update the process to the outsourced cloud was reported for BD deployments [41] to support ciphertext re-encryption.

    • Decentralized attribute-based conjunctive keyword search scheme with online/offline encryption and outsource decryption for cloud computing

      2019, Future Generation Computer Systems
      Citation Excerpt :

      ABKS scheme can be regarded as a new primitive of searching for data that is encrypted using an ABE system. Basically, there are two categories of ABE scheme [14–18]: Ciphertext-Policy ABE (CP-ABE) and Key-Policy ABE (KP-ABE). In the former one, the secret key is integrated with the attribute set and the access policy (predicate) is embedded in the ciphertext, while in KP-ABE scheme, the role of access policy and attribute set is reversed.

    View all citing articles on Scopus

    Yukun Zhou is currently a Ph.D. student majoring in computer architecture in Huazhong University of Science and Technology, Wuhan, China. His current research interests include data deduplication, cloud backup system, and storage security. He publishes several papers in major conferences including IEEE TC, MSST, INFOCOM, IPDPS and IFIP Performance etc.

    Dan Feng received the B.E., M.E., and Ph.D. degrees in Computer Science and Technology in 1991, 1994, and 1997, respectively, from Huazhong University of Science and Technology (HUST), China. She is a professor and vice dean of the School of Computer Science and Technology, HUST. Her research interests include computer architecture, massive storage systems, and parallel file systems. She has more than 100 publications in major journals and international conferences, including IEEE-TC, IEEE-TPDS, ACM-TOS, JCST, FAST, USENIX ATC, ICDCS, HPDC, SC, ICS, IPDPS, and ICPP. She serves on the program committees of multiple international conferences, including SC 2011, 2013 and MSST 2012. She is a member of IEEE and a member of ACM.

    Yu Hua received the B.E. and Ph.D. degrees in computer science from the Wuhan University, China, in 2001 and 2005, respectively. He is an associate professor at the Huazhong University of Science and Technology, China. His research interests include computer architecture, cloud computing, and network storage. He has more than 60 papers to his credit in major journals and international conferences including IEEE Transactions on Computers (TC), IEEE Transactions on Parallel and Distributed Systems (TPDS), USENIX ATC, USENIX FAST, INFOCOM, SC, ICDCS, ICPP. He has been on the program committees of multiple international conferences, including INFOCOM, ICDCS, ICPP and IWQoS. He is a senior member of the IEEE, and a member of ACM and USENIX.

    Wen Xia received his Ph.D. degree in Computer Science from Huazhong University of Science and Technology (HUST), China in 2014. He publishes over 30 papers in major journals and conferences including Proceedings of the IEEE (PIEEE), IEEE Transactions on Computers (TC), IEEE Transactions on Parallel and Distributed Systems (TPDS), USENIX ATC, FAST, INFOCOM, MSST, Performance, IPDPS, HotStorage, etc.

    Min Fu is currently a Ph.D. student majoring in computer architecture in Huazhong University of Science and Technology, Wuhan, China. His current research interests include data deduplication, backup and archival storage system, and key-value store. He publishes several papers in major conferences including USENIX ATC, FAST, etc.

    Fangting Huang received the B.E. degree in software engineering from the Sun Yatsen university (SYSU), China, in 2010. She is currently working toward the Ph.D. degree in computer architecture from the Huazhong University of Science and Technology (HUST). Her research interest includes computer architecture and storage systems.

    Yucheng Zhang is a Ph.D. student in Wuhan National Laboratory for Optoelectronics in HUST. His interests include data deduplication, chunking algorithm and storage systems. He publishes several papers in major conferences including IEEE TC, and INFOCOM etc.

    View full text