A privacy-preserved full-text retrieval algorithm over encrypted data for cloud storage applications

https://doi.org/10.1016/j.jpdc.2016.05.017Get rights and content

Highlights

  • We identify the problem of secure full-text retrieval over the encrypted data.

  • A full-text retrieval algorithm based on the hierarchical Bloom filter tree index.

  • A privacy-preserved ranking algorithm based on the membership entropy of words.

  • The security and efficiency analysis of the proposed scheme.

Abstract

As Cloud Computing becomes prevalent, more and more sensitive information has been outsourced into cloud. A straightforward methodology that can protect data privacy is to encrypt the data before outsourcing. Recently, many searchable encryption schemes have been proposed to allow users to execute keyword-based search over encrypted data. However, it is different for users to exactly find all the interested files from the huge amounts of data by relying solely on keyword-based search. In information retrieval domain, full-text retrieval is an efficient information retrieval technology that allows efficient searches over massive amount of web data. Unfortunately, when applied in the cloud paradigm, full-text retrieval over encrypted cloud data have not been well studied. The full-text retrieval service requires extracting all the words in the contents of documents. The huge scale of index words cannot be efficiently supported by the existing searchable encryption schemes. Moreover, to protect user’s privacy, a privacy-preserved full-text retrieval index is required. These problems make efficient full-text retrieval over a large amount of encrypted cloud data a very challenging task. In this paper, we first establish a set of strict privacy requirements for full-text retrieval in cloud storage systems. To address the challenging problem, we design a Bloom filter based tree index. Our scheme fine-tunes the similarity between the query and encrypted documents by proposing the membership entropies of index words. Our scheme is provably secure through our security analysis. We demonstrate the effectiveness and efficiency of the proposed scheme through extensive experimental evaluation. The experimental results manifest the search operation can be done in 60 milliseconds using an off-the-shelf moderate PC.

Introduction

Recently, cloud computing has emerged as a new platform for deploying, managing, and providing large scale data services through an Internet-based infrastructure. Successful examples include Amazon EC2, Google App Engine, and Microsoft Azure. By outsourcing data and services, cloud users enjoy a scalable high quality service in an economic and efficient manner. However, due to the nature of public, users usually prefer not to store their sensitive private information into cloud in plaintext form even if by law the data privacy is enforced.

A straightforward solution for users to protect their data privacy is to encrypt the data before outsourcing. This model has been adopted by Amazon IS service  [1]. Although privacy is preserved, data utilization, i.e., keyword-based search, becomes difficult over encrypted data. A naive way requires the data to be downloaded and decrypted to do search over plaintext. Such approach generates tremendous overheads of communication and computation, and significantly undermines the advantages of using cloud. The fully homomorphic encryption  [14] allows arbitrary operations including search operations over encrypted data, however, the computation overhead is too expensive that it is far from practical.

Searchable encryption schemes  [27], [20], [32], [13], [15], [10], [30], [9], [24], [12], [17], [8], [21], [26], [6], [29], [2], [7], [23], [18], [19], [16], [5], [28], [31], [25] have been developed in recent years for balancing search efficiency and data security. However, the existing approaches mainly focus on keyword-based search which is difficult to meet the requirements of the large scale cloud storage systems. Full-text retrieval is an information retrieval (IR) technology for content search over large scale data, and its effectiveness and efficiency has been verified by the success of popular Internet search engine systems. Providing privacy-preserved full-text retrieval services over encrypted data would greatly improve the utilization of cloud data. But existing full-text retrieval technologies do not take data security into account, so directly applying them in untrusted cloud needs careful security designs. Moreover, full-text retrieval needs to extract all the words in the contents of documents, which makes the scale of index words is much larger than that in the keyword-based search. As a result, it is far from practical to provide the full-text retrieval service for the cloud storage applications using the existing searchable encryption schemes.

In this paper, we formally define and solve the problem of full-text retrieval over encrypted cloud data in the cloud computing paradigm. Compared with the keyword-based search, the full-text retrieval has significant differences as below:

  • The scale of index words, The first difference is the scale of the indexed keywords. A full-text retrieval scheme must index the entire document instead of only indexing certain keywords in the existing schemes. Such requirement renders how to achieve compact index and high search efficiency as the critical challenging problem.

  • The similarity measure, Efficient and precise similarity measure between the query and the document is the second demanding requirement for the full-text retrieval scheme. Due to the large scale of the indexed words, using the word frequency and position as the similarity measure is not good enough to measure the matching score between queries and the index, especially for a query that contains a compound word such as “cloud computing”. Although several ranking algorithms have been proposed in plaintext full-text retrieval system, there still lacks a solution for searchable encryption techniques.

The Bloom filter  [4] is an excellent space-efficient randomized data structure for representing a set in order to support membership queries. To support the large scale of index words for the full-text retrieval services, we use the Bloom filter to store the index words. In this work, we design a Bloom filter based tree index, based on which we propose a security and efficient full-text retrieval algorithm. Before a user uploads a new document to the cloud, he encrypted this document and outputs the Bloom filters to represent the document’s contents. After the cloud server receives the document and the corresponding Bloom filters, it inserts these Bloom filters into the index to complete the document insertion. To protect user query privacy, the user encrypts his query keywords and generates the encrypted query request. And then, the cloud server executes the query over the full-text retrieval index. In addition, to save the valuable bandwidth, our scheme is able to rank the search result and return the most related documents to a user’s query. Compared with existing result ranking searchable encryption schemes  [27], [20], our scheme incurs lower computational cost on the cloud server. We design our innovative similarity measure between the index words and the documents based on the membership entropies. The main contributions of this paper are summarized as follows:

  • To the best of our knowledge, this is the first work that identifies the important problem of secure full-text retrieval over encrypted data for a large-scale cloud storage system.

  • To address this problem, we propose a hierarchical Bloom filter tree index, based on which an efficient and secure full-text retrieval scheme over encrypted cloud data is proposed.

  • We design a ranking algorithm based on the membership entropy of index words, in which we put partial rank computation involving the sensitive information at the data owner before uploading to achieve the security and efficient ranking query.

  • We thoroughly analyze the security and efficiency of the proposed scheme. We demonstrate the effectiveness and efficiency of our scheme through extensive experimental evaluation over the real dataset.

The rest of the paper is organized as follows. In Section  2, we set up the system model and explain the design objectives. Section  3 details the proposed scheme, followed by Section  4, which introduces the ranking algorithm. Section  5 theoretically analyzes the security and the efficiency of our scheme. Performance evaluation results are reported in Section  6. In Section  7, we discuss the related work. Our paper is concluded in Section  8.

Section snippets

Full-text retrieval

The motivation of our paper is to provide a secure and efficient full-text retrieval service for the large scale of encrypted documents in the cloud. The ‘keyword-based search’ is a type of query based on ‘matching’, i.e., the documents which contain the word that matches with the query word will be returned to the user as the result for the ‘keyword-based search’. In essence, ‘full-text retrieval’ is a type of query based on the ‘similarity’ between the documents and query words. The server

Overview of our scheme

We present the overview of the proposed privacy-preserved full-text retrieval framework in this subsection, which is shown in Fig. 3.

The processing flow of the full-text retrieval over the cloud encrypted data is as follows. While the cloud server begins to provide the storage services, it first sets up the global system parameters (H,m,k,p) to initialize the cloud storage system (step a). H represents a set of hash functions (H1,H2,,Hh),Hi:{0,1}[1,m](1ih) which hash arbitrary strings to

Ranking results for full-text retrieval

Cloud storage is a large scale application with huge amount of documents, so it needs to rank the results and returns the result documents in the order of their relevance to the query. The existing plaintext full-text retrieval models usually use the information of word frequency and inverse document frequency to rank the query results. The idea of ranking mechanism can be expressed as Eq. (3), where Score(di) represents the rank score of a document di for a query Q, N represents the total

Performance and security analysis

In this section, we theoretically analyze the correctness and the security of our scheme. Before giving the security analysis, we first present the correctness of the full-text retrieval in our scheme. Due to the false positive introduced by the Bloom filter, we use the recall and the precision to measure the accuracy of the search result.

Experimental setup

We use JAVA language to implement the proposed full-text retrieval scheme. We carry out the experiments on a server running Windows 7 with a 64-bit, 2.9 GHz CPU and 4 GB main memory. We run the experiments on a real world dataset: the Enron Email Dataset  [11], which are a total of 200, 399 messages belonging to 158 users. In our experiments, the parameters of our scheme include: m, h, k, and p. The parameters of our scheme in the experiments are shown in Table 2. In the experiments, we use the

Related work

This work mainly focuses on addressing the problem of the secure and efficient full-text retrieval over the encrypted data at cloud. The existing researches that are similar to ours can be found in the areas of keyword searchable encryption, rich functional encrypted data search, and ranked search over encrypted data.

Conclusion and future work

In this paper, we define and solved the problem of supporting efficient yet privacy-preserved full-text retrieval to enrich the query function over encrypted cloud data. We design a word-based hierarchical Bloom filter tree index structure to allow the authorized user to execute the full-text retrieval over the encrypted documents at cloud. Moreover, to efficiently utilize the valuable bandwidth resource, we design the ranking algorithm based on membership of index words to return only the most

Acknowledgments

This work is partially supported by National Natural Science Foundation of China No. 61202034, 61232002, CCF Opening Project of Chinese Information Processing No. CCF2014-01-02, and the Program for Innovative Research Team of Wuhan No. 2014070504020237.

Wei Song is an Assistant Professor at School of Computer, Wuhan University, China. He received his B.S. and Ph.D. degree from Huazhong University of Science and Technology, China, in 2001 and 2008, respectively. His current research interests are in the areas of applied cryptography and network security, with focus on secure data service in cloud computing and encrypted data search.

References (32)

  • Amazon Web Service....
  • M. Bellare, A. Boldyreva, A. O’Neill, Deterministic and efficiently searchable encryption, in: Proceedings of CRYPTO,...
  • M. Bellare, R. Canetti, H. Krawczyk, Keying hash functions for message authentication, in: Proceedings of CRYPTO, 1996,...
  • B. Bloom

    Space/time tradeoffs in hash coding with allowable errors

    Commun. ACM

    (1970)
  • A. Boldyreva, N. Chenette, Y. Lee, A. O’Neill, Order-preserving symmetric encryption, in: Proceedings of EUROCRYPT,...
  • D. Boneh, G.D. Crescenzo, R. Ostrovsky, G. Persiano, Public key encryptin with keyword search, in: Proceedings of...
  • D. Boneh, E. Kushilevitz, R. Ostrovsky, W.E. Skeith, Public key encryption that allows PIR queries, in: Proceedings of...
  • D. Boneh, B. Waters, Conjunctive, subset, and range queries on encrypted data, in: Proceedings of TCC, 2007, pp....
  • N. Cao, C. Wang, M. Li, K. Ren, W. Lou, Privacy-preserving multi-keyword ranked search over encrypted cloud data, in:...
  • Y.-C. Chang et al.

    Privacy preserving keyword searches on remote encrypted data

  • W.W. Cohen

    Enron email dataset

  • R. Curtmola, J.A. Garay, S. Kamara, R. Ostrovsky, Searchable symmertric encryption: Improved definitions and efficient...
  • Sabrina De Capitani di Vimercati, Sara Foresti, Sushil Jajodia, Stefano Paraboschi, Pierangela Samarati,...
  • C. Gentry, Fully homomorphic encryption using ideal lattices, in: Proceedings of STOC, 2009, pp....
  • E.-J. Goh, Secure indexes, IACR Cryptology ePrint Archive,...
  • L.L. Iacono, D. Torkian, A System-oriented approach to full-text search on encrypted cloud storage, in: Proceedings of...
  • Cited by (0)

    Wei Song is an Assistant Professor at School of Computer, Wuhan University, China. He received his B.S. and Ph.D. degree from Huazhong University of Science and Technology, China, in 2001 and 2008, respectively. His current research interests are in the areas of applied cryptography and network security, with focus on secure data service in cloud computing and encrypted data search.

    Bing Wang received his B.S. and M.E. degree in Computer Science from Fudan University and Shanghai Jiao-Tong University, respectively. He is currently working towards Ph.D. degree in Computer Science at Virginia Tech. His research interests are in the areas of applied cryptography and network security, with current focus on secure data service in cloud computing and next generation Internet. He is a student member of the IEEE.

    Qian Wang received the B.S. degree from Wuhan University, China, in 2003, the M.S. degree from Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, China, in 2006, and the Ph.D. degree from Illinois Institute of Technology, USA, in 2012, all in Electrical Engineering. He is currently a Professor with the School of Computer Science, Wuhan University. His research interests include wireless network security and privacy, cloud computing security, and applied cryptography. Qian is an expert under “1000 Young Talents Program” of China. He is a co-recipient of the Best Paper Award from IEEE ICNP 2011. He is a Member of the IEEE and a Member of the ACM.

    Zhiyong Peng received the B.S. and M.E. degree in Computer Science from Wuhan University and Changsha Institute of Technology of China, respectively. He received Ph.D. degree from Kyoto University of Japan in 1995. He is a professor at Wuhan University. Prior to join Wuhan University in 2000, he worked as a researcher at the Advanced Software Technology and Mechatronics Research Institute of Kyoto from 1995 to 1997 and was a member of the technical staff at Hewlett-Packard Laboratories, Japan from 1997 to 2000. His current research interests are in the database, trusted data management, and complex data management.

    Wenjing Lou is a Professor at Virginia Polytechnic Institute and State University. Prior to joining Virginia Tech in 2011, she was a faculty member at Worcester Poly-technic Institute from 2003 to 2011. She received her Ph.D. in Electrical and Computer Engineering at the University of Florida in 2003. Her current research interests are in cyber security, with emphases on wireless network security and data security and privacy in cloud computing. She was a recipient of the US National Science Foundation CAREER award in 2008.

    Yihui Cui received his M.E. degree in Computer Science from Wuhan University. He is currently working towards Ph.D. degree in Computer Science at Wuhan University. His research interests are in the areas of applied cryptography and network security, with current focus on data security in cloud computing.

    View full text