Abstract
Outsourcing data storage and computation to the cloud is appealing due to the cost savings it entails. However, when the data to be outsourced contain private information, appropriate protection mechanisms should be implemented by the data controller. Data splitting, which consists of fragmenting the data and storing them in separate clouds for the sake of privacy preservation, is an interesting alternative to encryption in terms of flexibility and efficiency. However, multivariate analyses on data split among various clouds are challenging, and they are even harder when data are nominal categorical (i.e., textual, non-ordinal), because the standard arithmetic operators cannot be used. In this article, we tackle the problem of outsourcing multivariate analyses on nominal data split over several honest-but-curious clouds. Specifically, we propose several secure protocols to outsource to multiple clouds the computation of a variety of multivariate analyses on nominal categorical data (frequency-based and semantic-based). Our protocols have been designed to outsource as much workload as possible to the clouds, in order to retain the cost-saving benefits of cloud computing while ensuring that the outsourced stay split and hence privacy-protected versus the clouds. The experiments we report on the Amazon cloud service show that by using our protocols the controller can save nearly all the runtime because it can integrate partial results received from the clouds with very little computation.
Similar content being viewed by others
References
Aggarwal G, Bawa M, Ganesan P, Garcia-Molina H, Kenthapadi K, Motwani R, Srivastava U, Thomas D, Xu Y (2005) Two can keep a secret: a distributed architecture for secure database services. CIDR 2005:186–199
Agresti A, Kateri M (2011) Categorical data analysis. Springer, Berlin
Amazon EC2 Instance Types. https://aws.amazon.com/ec2/instance-types/?nc1=h_ls
Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I, Zaharia M (2010) A view of cloud computing. Commun ACM 53(4):50–58
Atallah MJ, Frikken KB (2010) Securely outsourcing linear algebra computations. In: 5th ACM symposium on information, computer and communications security—ASIACCS 2010, ACM, pp 48–59
Batet M, Harispe S, Ranwez S, Sánchez D, Ranwez V (2014) An information theoretic approach to improve semantic similarity assessments across multiple ontologies. Inf Sci 283:197–2010
Batet M, Sánchez D (2015) A review on semantic similarity. In: Encyclopedia of information science and technology, 3rd edn. IGI Global, pp 7575–7583
California patient discharge data: California Office of Statewide Health Planning and Development (OSHPD), 2009. http://www.oshpd.ca.gov/HID/DataFlow/index.html
Calviño A, Ricci S, Domingo-Ferrer J (2015) Privacy-preserving distributed statistical computation to a semi-honest multi-cloud. In: IEEE conference on communications and network security (CNS 2015), IEEE, pp 506–514
Cimiano P (2006) Ontology learning and population from text: algorithms, evaluation and applications. Springer, Berlin
Ciriani V, De Capitani di Vimercati S, Foresti S, Jajodia S, Paraboschi S, Samarati P (2011) Selective data outsourcing for enforcing privacy. J Comput Secur 19(3):531–566
CLARUS—a Framework for user centred privacy and security in the cloud, H2020 project (2015–2017). http://www.clarussecure.eu
Clifton C, Kantarcioglu M, Vaidya J, Lin X, Zhu M (2002) Tools for privacy preserving distributed data mining. ACM SiGKDD Explor Newsl 4(2):28–34
Domingo-Ferrer J, Ricci S, Domingo-Enrich C (2018) Outsourcing scalar products and matrix products on privacy-protected unencrypted data stored in untrusted clouds. Inf Sci 436–437:320–342
Domingo-Ferrer J, Sánchez D, Rufian-Torrell G (2013) Anonymization of nominal data based on semantic marginality. Inf Sci 242:35–48
Domingo-Ferrer J, Torra V (2005) Ordinal, continuous and heterogeneous \(k\)-anonymity through microaggregation. Data Min Knowl Discov 11(2):195–212
Du W, Han Y, Chen S (2004) Privacy-preserving multivariate statistical analysis: linear regression and classification. In: SDM, vol 4. SIAM, pp 222–233
Dubovitskaya A, Urovi V, Vasirani M, Aberer K, Schumacher M (2015) A cloud-based eHealth architecture for privacy preserving data integration. In: ICT systems security and privacy protection, Springer, pp 585–598
Fu Z, Sun X, Ji S, Xie G (2016) Towards efficient content-aware search over encrypted outsourced data in cloud. In: Computer communications, IEEE INFOCOM 2016-the 35th annual IEEE international conference, IEEE, pp 1–9
General data protection regulation. European Union. http://www.gdpr-info.eu
Ghattas B, Michel P, Boyer L (2017) Clustering nominal data using unsupervised binary decision trees: comparisons with the state of the art methods. Pattern Recognit 67:177–85
Gelman A (2005) Analysis of variance—why it is more important than ever. Ann Stat 33(1):1–53
Goethals B, Laur S , Lipmaa H, Mielikäinen T (2005) On private scalar product computation for privacy-preserving data mining. In: Information security and cryptology—ICISC 2004, LNCS, vol 3506, Springer, pp 104–120
Hundepool A, Domingo-Ferrer J, Franconi L, Giessing S, Schulte Nordholt E, Spicer K, De Wolf P-P (2006) Statistical disclosure control. Wiley, Hoboken
Karr A, Lin X, Sanil A, Reiter J (2009) Privacy-preserving analysis of vertically partitioned data using secure matrix products. J Off Stat 25(1):125–138
Lei X, Liao X, Huang T, Li H, Hu C (2013) Outsourcing large matrix inversion computation to a public cloud. IEEE Trans Cloud Comput 1(1):78–87
Lei X, Liao X, Huang T, Heriniaina F (2014) Achieving security, robust cheating resistance, and high-efficiency for outsourcing large matrix multiplication computation to a malicious cloud. Inf Sci 280:205–217
Li H, Yang Y, Luan TH, Liang X, Zhou L, Shen XS (2016) Enabling fine-grained multi-keyword search supporting classified sub-dictionaries over encrypted cloud data. IEEE Trans Dependable Secur Comput 13(3):312–25
Li L, Lu R, Choo KK, Datta A, Shao J (2016) Privacy-preserving-outsourced association rule mining on vertically partitioned databases. IEEE Trans Inf Forensics Secur 11(8):1847–61
Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th international conference on machine learning, ICML 1998, pp 296–304
Nassar M, Erradi A, Sabry F, Malluhi Q M (2014) Secure outsourcing of matrix operations as a service. In: IEEE CLOUD 2013, IEEE, pp 918–925
Paillier P (1999) Public-key cryptosystems based on composite degree residuosity classes. In: Advances in cryptology—EUROCRYPT ’99, LNCS, vol 1592, Springer, pp 223–238
Rada R, Mili H, Bichnell E, Blettner M (1989) Development and application of a metric on semantic nets. IEEE Trans Syst Man Cybern 9:17–30
Ren K, Wang C, Wang Q (2012) Security challenges for the public cloud. IEEE Internet Comput 16(1):69–73
Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th international joint conference on artificial intelligence, IJCAI, vol 1, pp 448–453
Ricci S, Domingo-Ferrer J, Sánchez D (2016) Privacy-preserving cloud-based statistical analyses on sensitive categorical data. In: Modeling decisions for artificial intelligence, Springer, pp 227–238
Rodríguez-García M, Batet M, Sánchez D (2017) A semantic framework for noise addition with nominal data. Knowl Based Syst 112:103–118
Samarati P (2001) Protecting respondents’ identities in microdata release. IEEE Trans Knowl Data Eng 13(6):1010–1027
Sánchez D, Batet M (2017) Privacy-preserving data outsourcing in the cloud via semantic data splitting. Comput Commun 110:187–201
Sánchez D, Batet M, Isern D, Valls A (2012) Ontology-based semantic similarity: a new feature-based approach. Expert Syst Appl 39(9):7718–7728
Sánchez D, Batet M, Isern D (2011) Ontology-based information content computation. Knowl Based Syst 24(2):297–303
Sánchez D, Batet M, Martínez S, Domingo-Ferrer J (2015) Semantic variance: an intuitive measure for ontology accuracy evaluation. Eng Appl Artif Intell 39:89–99
SNOMED-CT Ontology. https://en.wikipedia.org/wiki/SNOMED_CT
Sun Y, Yu Y, Li X, Zhang K, Qian H, Zhou Y (2016) Batch verifiable computation with public verifiability for outsourcing polynomials and matrix computations. In: Australasian conference on information security and privacy—ACISP 2016, Lecture Notes in Computer Science, vol 9722, Springer, pp 293–309
Székely GJ, Rizzo ML (2009) Brownian distance covariance. Ann Appl Stat 3(4):1236–1265
Taha A, Hadi AS (2016) Pair-wise association measures for categorical and mixed data. Inf Sci 346:73–89
Tugrul B, Polat H (2014) Privacy-preserving kriging interpolation on partitioned data. Knowl Based Syst 62:38–46
U.S. Federal Trade Commission: Data Brokers, A Call for Transparency and Accountability (2014)
Wang I-C, Shen C-H, Hsu T-S, Liao C-C, Wang DW, Zhan J (2009) Towards empirical aspects of secure scalar product. IEEE Trans Syst Man Cybern Part C 39(4):440–447
Wu Z, Palmer M (1994) Verbs semantics and lexical selection. In: Proceedings of the annual meeting of the association for computational linguistics, pp 133–139
Xia Z, Wang X, Sun X, Wangm Q (2016) A secure and dynamic multi-keyword ranked search scheme over encrypted cloud data. IEEE Trans Parallel Distrib Syst 27(2):340–52
Yang JJ, Li JQ, Niu Y (2015) A hybrid solution for privacy preserving medical data sharing in the cloud environment. Future Gener Comput Syst 43:74–86
Zhang X, Boscardin WJ, Belin TR, Wan X, He Y, Zhang K (2015) A Bayesian method for analyzing combinations of continuous, ordinal, and nominal categorical data with missing values. J Multivar Anal 135:43–58
Acknowledgements
Partial support to this work has been received from the European Commission (projects H2020-700540 “CANVAS” and H2020-644024 “CLARUS”), from the Government of Catalonia (ICREA Acadèmia Prize to J. Domingo-Ferrer and grant 2017 SGR 705), and from the Spanish Government (projects RTI2018-095094-B-C21 “CONSENT” and TIN2016-80250-R “Sec-MCloud”). The authors are with the UNESCO Chair in Data Privacy, but the views in this paper are the authors’ own and are not necessarily shared by UNESCO.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Semantic distance calculation
The semantic distance quantifies the difference between the meaning of two nominal values. Semantic similarity/distance measures rely on the semantic evidences gathered from knowledge bases, such as ontologies, which taxonomically structure the concepts of a domain of knowledge [7]. Formally, an ontology\({\mathcal {O}}\) is composed, at least, of a set of concepts or classes C organized in a directed acyclic graph (due to multiple inheritance) by means of is-a (\(c_i < c_j\)) relationships [10], as shown in Fig. 2.
Measuring the semantic distance in large ontologies can be costly. In this section, we discuss the computational cost of some well-known measures by relying on the concepts introduced in the following definition.
Definition 1
Let \(S(\mathbf{X^a} )\) be the set of subsumers (i.e., taxonomic ancestors) of the nominal values of attribute \(\mathbf X^a\) mapped in an ontology \({\mathcal {O}}\). The least common subsumer of \(\mathbf X^a\), denoted by \(LCS(\mathbf{X^a})\), is the most specific concept in \(S(\mathbf{X^a})\). Formally,
The semantic distance is defined as a function \(d_s: {\mathcal {O}} \times {\mathcal {O}} \rightarrow {\mathbb {R}}\) mapping a pair of concepts (corresponding to nominal values) to a real number that quantifies the difference between their meanings. According to the calculation principle employed, ontology-based measures can be divided in three families:
- 1.
Edge-counting measures.
- 2.
Feature-based measures.
- 3.
Measures based on information content.
1.1 A. 1 Edge-counting measures
They estimate the semantic distance between concept pairs as a function of the length of the taxonomic path connecting the two concepts in the ontology [33].
A well-known edge-counting measure was proposed by Wu and Palmer [50]:
where \({\mathrm{denominator}} = 2\times \text {depth}(LCS( c_1 , c_2 )) + \text {path}(c_1, LCS( c_1 , c_2 )) + \text {path}(c_2, LCS( c_1 , c_2 ))\); \(LCS(c_1 ,c_2)\) is the most specific subsumer of \(c_1\) and \(c_2\) in the ontology; \(\text {depth}(LCS( c_1 , c_2 ))\) is the number of nodes in the longest taxonomic path between the \(LCS(c_1 ,c_2 )\) and the node root of the taxonomy; and \(\text {path}(c_i, LCS( c_1,\)\(c_2 ))\) is the number of taxonomic edges in the shortest taxonomic path between the two concepts.
Simplicity is the main advantage of edge-counting measures. However, they present some shortcomings: (1) if they are applied to ontologies incorporating multiple taxonomical inheritance, several taxonomical paths are not taken into account, and (2) by considering only the paths (i.e., subsumers) between the concepts, much of the taxonomical knowledge explicitly modeled in the ontology is ignored.
Assuming that concepts in the ontology are linked with their ancestors through pointers, in the worst case (comparing the two most specific concepts in the ontology that have the root node as LCS), obtaining the \(LCS(c_1 ,c_2)\) requires running through the longest path in the taxonomy, i.e., twice the taxonomy depth D. Therefore, it takes O(D) cost to compute Expression (10).
1.2 A.2 Feature-based measures
They consider the degree of overlap between the sets of ontological features of the concepts to be compared. In [40], the authors suggested measuring the semantic distance as a function of taxonomic features, i.e., as the ratio between the number of non-common taxonomic ancestors and the total number of ancestors of the two concepts:
where \(S(c_i)\) is the set of taxonomic subsumers of the concept \(c_i\), for \(i = 1,2\). Due to the additional knowledge feature-based measures take into account (i.e., multiple direct ancestors in case of multiple inheritance), they tend to be more accurate than edge-counting measures [40].
If S is the maximum number of ancestors that a concept can have in the ontology, computing Expression (11) takes O(S) cost. Notice that, for ontologies without multiple inheritance, this cost is the same as the one of edge-counting measures.
1.3 A.3 Measures based on information content
They measure the semantic distance between two concepts as the inverse of the amount of information they share in the ontology, which is represented by their LCS [35]. In particular, Lin [30] proposed as a measure the inverse of the ratio between the information content of the LCS of the concepts and the sum of the information content of each concept.
In [41], IC(c) is intrinsically estimated within the ontology as the normalized ratio between the number of leaves (i.e., terminal hyponyms) under concept c in the taxonomy and the number of subsumers of c:
Thanks to IC-based measures exploiting the largest amount of ontological evidence (i.e., ancestors and leaves), they achieve better accuracy than edge-counting and feature-based measures [6].
Expression (12) requires computing the LCS of the two concepts, plus the ICs of the LCS and the concepts. Like in edge-counting measures, computing the LCS has a worst-case complexity O(D). On the other hand, Expression (13) requires obtaining all the possible concepts connected to c, either subsumers of hyponyms; hence, in the worst case (i.e., when c is the root node, which subsumes all the concepts in the ontology), the IC computation takes O(C) cost, where C is the total number of concepts in the taxonomy. In conclusion, Expression (12) has \(O(C+D)\) computational cost. Thus, IC-based measures are not only the most accurate but also the costliest.
B Security of the scalar product protocols used
1.1 B. 1 Proof of Proposition 1
Charlie receives \(\varvec{r}'_x\) from Alice. But \(\varvec{r}'_x\) can be obtained as the difference between \(\hat{\varvec{x}}'+{\varvec{k}}\) and \({\varvec{x}}+{\varvec{k}}\), where \({\varvec{k}}\) is an n-vector with all its components set to k and k is any real number. Hence, Charlie learns nothing about \(\varvec{x}\). A similar argument shows that Charlie learns nothing about \(\varvec{y}\).
Bob receives \(\hat{\varvec{x}}'\) from Alice and \({\varvec{r}}_y\) from Charlie. Clearly, \({\varvec{r}}_y\) contains no information on \({\varvec{x}}\). On the other hand,
Since \(\mathbf{P}_x\) is a random permutation, the probability of Bob’s learning \(\hat{\varvec{x}}\) from \(\hat{\varvec{x}}'\) is 1 over the number of permutations of \(\hat{\varvec{x}}\), that is
where \(d_x\) is the number of different values among the n values of \(\hat{\varvec{x}}\), and \(n^x_i\) is the number of repetitions of the ith different value. Since \(\hat{\varvec{x}}\) is the result of adding a random vector to \({\varvec{x}}\), it is highly unlikely that \(\hat{\varvec{x}}\) contains repeated values, so the probability of Bob’s learning \(\hat{\varvec{x}}\) is very low. Furthermore, Bob does not know \({\varvec{r}}_x\). Without knowledge of \(\hat{\varvec{x}}\) and \({\varvec{r}}_x\), Bob cannot learn \(\varvec{x}\).
The argument on the inability of Alice to learn \(\varvec{y}\) is analogous.
1.2 B.2 On the security of Protocol 2
Protocol 2 is a variation of a protocol proposed in [23]. The latter protocol takes place only between Alice and Bob and there is no CLARUS proxy. Thus it differs from Protocol 2 in the last three steps, which are as follows:
- 4.
Bob generates a random plaintext \(s_B\), a random number \(r'\) and sends \(\omega ' = \omega Enc_{p_k}(-s_B;r')\) to Alice.
- 5.
Alice computes \(s_A = Dec_{s_k}(\omega ') = \varvec{x}^T \varvec{y} - s_B\).
- 6.
Alice and Bob simultaneously exchange the values \(s_A\) and \(s_B\), respectively, so that both can compute \(s_A + s_B = \varvec{x}^T \varvec{y}\).
The authors of [23] prove that, if Paillier’s cryptosystem is secure, Alice cannot learn \({\varvec{y}}\) and Bob cannot learn \(\mathbf{x}\) in their protocol.
The only modification introduced by Protocol 2 is that Alice and Bob do not share their results \(s_A\) and \(s_B\), but they send these values to CLARUS. Since neither Alice nor Bob have more information than in the protocol of [23], the security of the latter protocol is preserved in Protocol 2.
Rights and permissions
About this article
Cite this article
Domingo-Ferrer, J., Sánchez, D., Ricci, S. et al. Outsourcing analyses on privacy-protected multivariate categorical data stored in untrusted clouds. Knowl Inf Syst 62, 2301–2326 (2020). https://doi.org/10.1007/s10115-019-01424-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-019-01424-4