Elsevier

Computer Communications

Volume 82, 15 May 2016, Pages 71-82
Computer Communications

Proof of ownership for deduplication systems: A secure, scalable, and efficient solution

https://doi.org/10.1016/j.comcom.2016.01.011Get rights and content

Abstract

Deduplication is a technique used to reduce the amount of storage needed by service providers. It is based on the intuition that several users may want (for different reasons) to store the same content. Hence, storing a single copy of these files would be sufficient. Albeit simple in theory, the implementation of this concept introduces many security risks. In this paper, we address the most severe one: an adversary, possessing only a fraction of the original file, or colluding with a rightful owner who leaks arbitrary portions of it, becomes able to claim possession of the entire file. The paper’s contributions are manifold: first, we review the security issues introduced by deduplication, and model related security threats; second, we introduce a novel Proof of Ownership (POW) scheme with all the features of the state-of-the-art solution and only a fraction of its overhead. We also show that the security of the proposed mechanisms relies on information-theoretical rather than computational assumptions, and propose viable optimization techniques that further improve the scheme’s performance. Finally, the quality of our proposal is supported by extensive benchmarking.

Introduction

The rapid surge in cloud service offerings has resulted in a sharp drop in prices of storage services, and in an increase in the number of customers. Through popular providers, like Amazon s3 and Microsoft Azure, and backup services, like Dropbox and Memopal, storage has indeed become a commodity. Among the reasons for the low prices, we find a strong use of multitenancy, the reliance on distributed algorithms run on top of simple hardware, and an efficient use of the storage backend thanks to compression and deduplication. Deduplication is widely adopted in practice for instance by services such as Bitcasa [7] and Ciphertite [9].

Deduplication is the process to avoid having to store the same data multiple times. It leverages the fact that large data sets often exhibit high redundancy. Examples include common email attachments, financial records, with common headers and semi-identical fields, and popular media content—such as music, videos—likely to be owned (and stored) by many users.

The benefit of deduplication can be measured using the space reduction ratio, defined as the the number of bytes input to a data deduplication process divided by the number of bytes output [13]. In the case of a cloud storage service, this is the ratio between the total size of files uploaded by all clients and the total storage space used by the service. Deduplication ratio is 11spacereductionratio.A perfect deduplication scheme will detect all duplicates. Hence its deduplication ratio will be close to 1. The level of deduplication achievable depends on a number of factors. In common business settings, deduplication ratios in the range of 4: 1 (.75) to 500: 1 (.998) are typical. Wendt [28] suggest that a figure in the range 10: 1 (.90) to 20: 1 (.95) is a realistic expectation.

There are four different deduplication strategies, depending on whether deduplication happens at the client side (i.e. before the upload) or at the server side, and whether it happens at a block level or at a file level. Deduplication is most efficient when it is triggered at the client side, as it also saves upload bandwidth. For these reasons, deduplication is a critical enabler for a number of popular and successful storage services (e.g. Dropbox, Memopal) that offer cheap remote storage to the broad public by performing client-side deduplication, thus saving both the network bandwidth and the storage costs associated with processing the same content multiple times. However, new technologies introduce new vulnerabilities, and deduplication is no exception.

Harnik et al. [17] have identified a number of threats that can affect a storage system performing client-side deduplication. These threats, briefly reviewed in the following, can be turned into practical attacks by any user of the system.

A first set of attacks targets the privacy and confidentiality of users of the storage system. For instance, a user can check whether another user has already uploaded a file by trying to upload it as well, and by checking—e.g. by monitoring local network traffic—whether the upload actually takes place. This attack is particularly relevant for rare files that may reveal the identity of the user who performed the original upload. This attack, as shown in [17], can also be turned into an attack targeting the discovery of the content of a file. Suppose a document has standard headings, text and signature, contains mostly public information, and has only a small subset of private information. A malicious user can forge all possible combinations of such a document, upload them all and check for the ones that undergo deduplication.

A different type of attack can turn deduplication features into a covert channel. Two users who have no direct connectivity could try to use the storage system as a covert channel. For instance, to exchange a bit of information, the two users would pre-agree on two files. Then the transmitting user uploads one of the two files; the receiving user detects which one gets deduplicated and outputs either 0 (for the first file) or 1 (for the second one).

Finally, users can abuse a storage system by turning it into a content-distribution network: users who wish to exchange large files leveraging the large bandwidth available to the servers of the storage systems could upload only a single copy of such a file and share the short token that triggers deduplication (in most cases, the hash digest of the file) among all users who wish to download the content. Real-world examples of such an approach include Dropship [27].

To remedy the security threats mentioned above, the concept of Proof of Ownership (POW) has been introduced [15]. POW schemes essentially address the root-cause of the aforementioned attacks to deduplication, namely, that the proof that the client owns a given file (or block of data) is solely based on a static, short value (in most cases the hash digest of the file), whose knowledge automatically grants access to the file. Note that POW schemes only apply to the initial outsourcing of the file, where the client still keeps a copy of the file, and the server may or may not already have a copy of it. After this phase, the client is free to delete the file or keep it, whereas the server guarantees to its customers to keep at least a copy of it, together with an access control list that assigns files to users.

POW schemes are security protocols designed to allow a server to verify (with a given degree of assurance) whether a client owns a file. The probability that a malicious client engages in a successful POW run must be negligible in the security parameter, even if the malicious client knows a (relevant) portion of the target file. A POW scheme should be efficient in terms of CPU, bandwidth and I/O for both the server and all legitimate clients: in particular, POW schemes should not require the server to load the file (or large portions of it) from its back-end storage at each execution of POW.

Additional assumptions about POW schemes are that they should take into account the fact that a user wishing to engage in a successful POW run with the server may be colluding with other users who possess the file and are willing to help in circumventing POW checks. These latter users, however, are neither assumed to always be online (i.e. they cannot answer the POW challenges on behalf of the malicious user), nor are they willing to exchange very large amounts of data with the malicious user. Both assumptions are arguably reasonable, as such users would have no strong incentive in helping the free-riders.

Halevi et al. [15] have introduced the first practical cryptographic protocol that implements POW. Their seminal work, however, suffers from a number of shortcomings that might hinder its adoption. The first is that the scheme has extremely high I/O requirements at the client-side: it either requires clients to load the entire file into memory or to perform random block accesses with an aggregate total I/O higher than the size of the file itself. Secondly, the scheme takes a heavy computational toll on the client. Thirdly, its security is admittedly based on assumptions that are hard to verify. Finally, its good performance on the server side depend strongly on setting a hard limit (64 MiB) on the amount of bytes an adversary is entitled to receive from a legitimate file owner.

In this paper, inspired by [11], we propose a novel family of schemes for secure Proof of Ownership. The different constructions attain several ambitious goals: (i) their I/O and computational costs do not depend on the input file size; (ii) they are very efficient for a wide range of systems parameters; (iii) they are information-theoretically secure; and, (iv) they require the server to keep a per-file state that is a negligible fraction of the input file size.

The remainder of this paper is organised as follows: Section 2 reviews the state of the art. Section 3 defines system and security models. Section 4 presents the basic scheme and three optimizations. Section 5 describes the implementation and benchmarks. Section 6 contains a discussion on the performance and potential for further optimizations, while Section 7 presents our conclusions.

Section snippets

Related work

Several deduplication schemes have been proposed by the research community [2], [20], [22] showing how deduplication allows very appealing reductions in the use of storage resources [13], [16].

Douceur et al. [12] study the problem of deduplication in a multitenant system where deduplication has to be reconciled with confidentiality. The authors propose the use of convergent encryption. Convergent encryption of a message consists of encrypting the plaintext using a deterministic (symmetric)

System model

The system is composed of two main principals, the client C and the server S. Both C and S are computing nodes with network connectivity. S has a large back-end storage facility and offers its storage capacity to C; C uploads its files and can later download them. During the upload process, S attempts to minimize the bandwidth and to optimize the use of its storage facility by determining whether the file the client is about to upload has already been uploaded by another user. If so, the file

Our scheme

In this section, we shall describe our POW solution. Our scheme consists of two separate phases: in the first phase, the server receives a file for the first time and pre-computes the responses for a number of POW challenges related to the file. Computation of POW challenges for a given file is carried out both upon receiving an upload request for a file that is not yet present at the server side, and when the stock of the previously computed challenge/responses has been depleted. The number of

Running POW

To evaluate the effectiveness of our proposals, we have implemented both bPOW and sPOW and its two variants. The code has been developed in C++ using the OpenSSL crypto library for all cryptographic operations and using the Saito–Matsumoto implementation [25] of the Mersenne twister [21] for the pseudo random number generator. The code implements both the client-side and the server-side of all schemes. The interactions between client and server as well as the data exchange have been virtualised

Comparison and discussion

In light of the analysis performed in the previous section, we now compare the state of the art solution and our proposals. Table 1 compares bPOW, sPOW and sPOW-pHash in terms of computational cost, I/O, storage and bandwidth requirements; we omitted sPOW-sHash from the comparison as it has the same asymptotic costs as sPOW-pHash.

On the client-side sPOW and sPOW-pHash are far less demanding than bPOW from both the computational and the I/O perspective. This is a highly desirable characteristic,

Conclusions

We have presented a suite of novel security protocols to implement proof of ownership in a deduplication scenario. Our core scheme is provably secure and achieves better performance than the state-of-the-art solution in the most sensitive areas of client-side I/O and computation. Furthermore, it is resilient to a malicious client leaking large portions of the input file to third parties, whereas other schemes described in the literature will be compromised in case of leaks that are larger than

Acknowledgments

The authors would like to thank Geoff Kuenning and John Douceur for their help in accessing statistics on file sizes; Alexandra Shulman-Peleg and Christian Cachin for their insightful feedback; Charlotte Bolliger for her corrections.

Finally, the authors would like to thank the anonymous reviewers for their insightful comments that helped improving the quality of the manuscript.

References (30)

  • N. Agrawal et al.

    A five-year study of file-system metadata

    Trans. Storage

    (October 2007)
  • L. Aronovich et al.

    The design of a similarity based deduplication system

    Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference

    (2009)
  • G. Ateniese et al.

    Remote data checking using provable data possession

    ACM Trans. Inf. Syst. Secur.

    (2011)
  • G. Ateniese et al.

    Provable data possession at untrusted stores

    Proceedings of the 14th ACM Conference on Computer and Communications Security

    (2007)
  • G. Ateniese et al.

    Scalable and efficient provable data possession

    Proceedings of the 4th International Conference on Security and Privacy in Communication Netowrks

    (2008)
  • M. Bellare, S. Keelveedhi, T. Ristenpart, Message-locked encryption and secure deduplication, Adv. Cryptol. 2012,...
  • Bitcasa, Inc., Bitcasa Infinite Storage, 2013, (https://www.bitcasa.com/) (accessed...
  • J. Blasco et al.

    A tunableproof of ownership scheme for deduplication using bloom filters

    (2014)
  • Conformal Systems, LLC., Cyphertite High Security Online Backup, 2013, (https://www.cyphertite.com/) (accessed June...
  • F. Lombardi et al.

    Security for cloud computing

    (Copyright: 2015)
  • R. Di Pietro et al.

    Boosting efficiency and security in proof of ownership for deduplication

    Proceedings of the ASIACCS

    (2012)
  • J.R. Douceur et al.

    Reclaiming space from duplicate files in a serverless distributed file system

    Proceedings of the ICDCS

    (2002)
  • M. Dutch, L. Freeman, Understanding data de-duplication ratios, 2008, (SNIA forum)....
  • C. Erway et al.

    Dynamic provable data possession

    Proceedings of the 16th ACM Conference on Computer and Communications Security

    (2009)
  • S. Halevi et al.

    Proofs of ownership in remote storage systems

    ACM Conference on Computer and Communications Security

    (2011)
  • Cited by (0)

    View full text