Skip to main content
Log in

Outsourcing analyses on privacy-protected multivariate categorical data stored in untrusted clouds

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Outsourcing data storage and computation to the cloud is appealing due to the cost savings it entails. However, when the data to be outsourced contain private information, appropriate protection mechanisms should be implemented by the data controller. Data splitting, which consists of fragmenting the data and storing them in separate clouds for the sake of privacy preservation, is an interesting alternative to encryption in terms of flexibility and efficiency. However, multivariate analyses on data split among various clouds are challenging, and they are even harder when data are nominal categorical (i.e., textual, non-ordinal), because the standard arithmetic operators cannot be used. In this article, we tackle the problem of outsourcing multivariate analyses on nominal data split over several honest-but-curious clouds. Specifically, we propose several secure protocols to outsource to multiple clouds the computation of a variety of multivariate analyses on nominal categorical data (frequency-based and semantic-based). Our protocols have been designed to outsource as much workload as possible to the clouds, in order to retain the cost-saving benefits of cloud computing while ensuring that the outsourced stay split and hence privacy-protected versus the clouds. The experiments we report on the Amazon cloud service show that by using our protocols the controller can save nearly all the runtime because it can integrate partial results received from the clouds with very little computation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  1. Aggarwal G, Bawa M, Ganesan P, Garcia-Molina H, Kenthapadi K, Motwani R, Srivastava U, Thomas D, Xu Y (2005) Two can keep a secret: a distributed architecture for secure database services. CIDR 2005:186–199

    Google Scholar 

  2. Agresti A, Kateri M (2011) Categorical data analysis. Springer, Berlin

    MATH  Google Scholar 

  3. Amazon EC2 Instance Types. https://aws.amazon.com/ec2/instance-types/?nc1=h_ls

  4. Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I, Zaharia M (2010) A view of cloud computing. Commun ACM 53(4):50–58

    Article  Google Scholar 

  5. Atallah MJ, Frikken KB (2010) Securely outsourcing linear algebra computations. In: 5th ACM symposium on information, computer and communications security—ASIACCS 2010, ACM, pp 48–59

  6. Batet M, Harispe S, Ranwez S, Sánchez D, Ranwez V (2014) An information theoretic approach to improve semantic similarity assessments across multiple ontologies. Inf Sci 283:197–2010

    Article  Google Scholar 

  7. Batet M, Sánchez D (2015) A review on semantic similarity. In: Encyclopedia of information science and technology, 3rd edn. IGI Global, pp 7575–7583

  8. California patient discharge data: California Office of Statewide Health Planning and Development (OSHPD), 2009. http://www.oshpd.ca.gov/HID/DataFlow/index.html

  9. Calviño A, Ricci S, Domingo-Ferrer J (2015) Privacy-preserving distributed statistical computation to a semi-honest multi-cloud. In: IEEE conference on communications and network security (CNS 2015), IEEE, pp 506–514

  10. Cimiano P (2006) Ontology learning and population from text: algorithms, evaluation and applications. Springer, Berlin

    Google Scholar 

  11. Ciriani V, De Capitani di Vimercati S, Foresti S, Jajodia S, Paraboschi S, Samarati P (2011) Selective data outsourcing for enforcing privacy. J Comput Secur 19(3):531–566

    Article  Google Scholar 

  12. CLARUS—a Framework for user centred privacy and security in the cloud, H2020 project (2015–2017). http://www.clarussecure.eu

  13. Clifton C, Kantarcioglu M, Vaidya J, Lin X, Zhu M (2002) Tools for privacy preserving distributed data mining. ACM SiGKDD Explor Newsl 4(2):28–34

    Article  Google Scholar 

  14. Domingo-Ferrer J, Ricci S, Domingo-Enrich C (2018) Outsourcing scalar products and matrix products on privacy-protected unencrypted data stored in untrusted clouds. Inf Sci 436–437:320–342

    Article  MathSciNet  Google Scholar 

  15. Domingo-Ferrer J, Sánchez D, Rufian-Torrell G (2013) Anonymization of nominal data based on semantic marginality. Inf Sci 242:35–48

    Article  Google Scholar 

  16. Domingo-Ferrer J, Torra V (2005) Ordinal, continuous and heterogeneous \(k\)-anonymity through microaggregation. Data Min Knowl Discov 11(2):195–212

    Article  MathSciNet  Google Scholar 

  17. Du W, Han Y, Chen S (2004) Privacy-preserving multivariate statistical analysis: linear regression and classification. In: SDM, vol 4. SIAM, pp 222–233

  18. Dubovitskaya A, Urovi V, Vasirani M, Aberer K, Schumacher M (2015) A cloud-based eHealth architecture for privacy preserving data integration. In: ICT systems security and privacy protection, Springer, pp 585–598

  19. Fu Z, Sun X, Ji S, Xie G (2016) Towards efficient content-aware search over encrypted outsourced data in cloud. In: Computer communications, IEEE INFOCOM 2016-the 35th annual IEEE international conference, IEEE, pp 1–9

  20. General data protection regulation. European Union. http://www.gdpr-info.eu

  21. Ghattas B, Michel P, Boyer L (2017) Clustering nominal data using unsupervised binary decision trees: comparisons with the state of the art methods. Pattern Recognit 67:177–85

    Article  Google Scholar 

  22. Gelman A (2005) Analysis of variance—why it is more important than ever. Ann Stat 33(1):1–53

    Article  MathSciNet  MATH  Google Scholar 

  23. Goethals B, Laur S , Lipmaa H, Mielikäinen T (2005) On private scalar product computation for privacy-preserving data mining. In: Information security and cryptology—ICISC 2004, LNCS, vol 3506, Springer, pp 104–120

  24. Hundepool A, Domingo-Ferrer J, Franconi L, Giessing S, Schulte Nordholt E, Spicer K, De Wolf P-P (2006) Statistical disclosure control. Wiley, Hoboken

    Google Scholar 

  25. Karr A, Lin X, Sanil A, Reiter J (2009) Privacy-preserving analysis of vertically partitioned data using secure matrix products. J Off Stat 25(1):125–138

    Google Scholar 

  26. Lei X, Liao X, Huang T, Li H, Hu C (2013) Outsourcing large matrix inversion computation to a public cloud. IEEE Trans Cloud Comput 1(1):78–87

    Google Scholar 

  27. Lei X, Liao X, Huang T, Heriniaina F (2014) Achieving security, robust cheating resistance, and high-efficiency for outsourcing large matrix multiplication computation to a malicious cloud. Inf Sci 280:205–217

    Article  Google Scholar 

  28. Li H, Yang Y, Luan TH, Liang X, Zhou L, Shen XS (2016) Enabling fine-grained multi-keyword search supporting classified sub-dictionaries over encrypted cloud data. IEEE Trans Dependable Secur Comput 13(3):312–25

    Article  Google Scholar 

  29. Li L, Lu R, Choo KK, Datta A, Shao J (2016) Privacy-preserving-outsourced association rule mining on vertically partitioned databases. IEEE Trans Inf Forensics Secur 11(8):1847–61

    Article  Google Scholar 

  30. Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th international conference on machine learning, ICML 1998, pp 296–304

  31. Nassar M, Erradi A, Sabry F, Malluhi Q M (2014) Secure outsourcing of matrix operations as a service. In: IEEE CLOUD 2013, IEEE, pp 918–925

  32. Paillier P (1999) Public-key cryptosystems based on composite degree residuosity classes. In: Advances in cryptology—EUROCRYPT ’99, LNCS, vol 1592, Springer, pp 223–238

  33. Rada R, Mili H, Bichnell E, Blettner M (1989) Development and application of a metric on semantic nets. IEEE Trans Syst Man Cybern 9:17–30

    Article  Google Scholar 

  34. Ren K, Wang C, Wang Q (2012) Security challenges for the public cloud. IEEE Internet Comput 16(1):69–73

    Article  MathSciNet  Google Scholar 

  35. Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th international joint conference on artificial intelligence, IJCAI, vol 1, pp 448–453

  36. Ricci S, Domingo-Ferrer J, Sánchez D (2016) Privacy-preserving cloud-based statistical analyses on sensitive categorical data. In: Modeling decisions for artificial intelligence, Springer, pp 227–238

  37. Rodríguez-García M, Batet M, Sánchez D (2017) A semantic framework for noise addition with nominal data. Knowl Based Syst 112:103–118

    Article  Google Scholar 

  38. Samarati P (2001) Protecting respondents’ identities in microdata release. IEEE Trans Knowl Data Eng 13(6):1010–1027

    Article  Google Scholar 

  39. Sánchez D, Batet M (2017) Privacy-preserving data outsourcing in the cloud via semantic data splitting. Comput Commun 110:187–201

    Article  Google Scholar 

  40. Sánchez D, Batet M, Isern D, Valls A (2012) Ontology-based semantic similarity: a new feature-based approach. Expert Syst Appl 39(9):7718–7728

    Article  Google Scholar 

  41. Sánchez D, Batet M, Isern D (2011) Ontology-based information content computation. Knowl Based Syst 24(2):297–303

    Article  Google Scholar 

  42. Sánchez D, Batet M, Martínez S, Domingo-Ferrer J (2015) Semantic variance: an intuitive measure for ontology accuracy evaluation. Eng Appl Artif Intell 39:89–99

    Article  Google Scholar 

  43. SNOMED-CT Ontology. https://en.wikipedia.org/wiki/SNOMED_CT

  44. Sun Y, Yu Y, Li X, Zhang K, Qian H, Zhou Y (2016) Batch verifiable computation with public verifiability for outsourcing polynomials and matrix computations. In: Australasian conference on information security and privacy—ACISP 2016, Lecture Notes in Computer Science, vol 9722, Springer, pp 293–309

  45. Székely GJ, Rizzo ML (2009) Brownian distance covariance. Ann Appl Stat 3(4):1236–1265

    Article  MathSciNet  MATH  Google Scholar 

  46. Taha A, Hadi AS (2016) Pair-wise association measures for categorical and mixed data. Inf Sci 346:73–89

    Article  Google Scholar 

  47. Tugrul B, Polat H (2014) Privacy-preserving kriging interpolation on partitioned data. Knowl Based Syst 62:38–46

    Article  MATH  Google Scholar 

  48. U.S. Federal Trade Commission: Data Brokers, A Call for Transparency and Accountability (2014)

  49. Wang I-C, Shen C-H, Hsu T-S, Liao C-C, Wang DW, Zhan J (2009) Towards empirical aspects of secure scalar product. IEEE Trans Syst Man Cybern Part C 39(4):440–447

    Article  Google Scholar 

  50. Wu Z, Palmer M (1994) Verbs semantics and lexical selection. In: Proceedings of the annual meeting of the association for computational linguistics, pp 133–139

  51. Xia Z, Wang X, Sun X, Wangm Q (2016) A secure and dynamic multi-keyword ranked search scheme over encrypted cloud data. IEEE Trans Parallel Distrib Syst 27(2):340–52

    Article  Google Scholar 

  52. Yang JJ, Li JQ, Niu Y (2015) A hybrid solution for privacy preserving medical data sharing in the cloud environment. Future Gener Comput Syst 43:74–86

    Article  Google Scholar 

  53. Zhang X, Boscardin WJ, Belin TR, Wan X, He Y, Zhang K (2015) A Bayesian method for analyzing combinations of continuous, ordinal, and nominal categorical data with missing values. J Multivar Anal 135:43–58

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

Partial support to this work has been received from the European Commission (projects H2020-700540 “CANVAS” and H2020-644024 “CLARUS”), from the Government of Catalonia (ICREA Acadèmia Prize to J. Domingo-Ferrer and grant 2017 SGR 705), and from the Spanish Government (projects RTI2018-095094-B-C21 “CONSENT” and TIN2016-80250-R “Sec-MCloud”). The authors are with the UNESCO Chair in Data Privacy, but the views in this paper are the authors’ own and are not necessarily shared by UNESCO.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Josep Domingo-Ferrer.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Semantic distance calculation

The semantic distance quantifies the difference between the meaning of two nominal values. Semantic similarity/distance measures rely on the semantic evidences gathered from knowledge bases, such as ontologies, which taxonomically structure the concepts of a domain of knowledge [7]. Formally, an ontology\({\mathcal {O}}\) is composed, at least, of a set of concepts or classes C organized in a directed acyclic graph (due to multiple inheritance) by means of is-a (\(c_i < c_j\)) relationships [10], as shown in Fig. 2.

Fig. 2
figure 2

Ontology extract for the “Diagnosis” concept

Measuring the semantic distance in large ontologies can be costly. In this section, we discuss the computational cost of some well-known measures by relying on the concepts introduced in the following definition.

Definition 1

Let \(S(\mathbf{X^a} )\) be the set of subsumers (i.e., taxonomic ancestors) of the nominal values of attribute \(\mathbf X^a\) mapped in an ontology \({\mathcal {O}}\). The least common subsumer of \(\mathbf X^a\), denoted by \(LCS(\mathbf{X^a})\), is the most specific concept in \(S(\mathbf{X^a})\). Formally,

$$\begin{aligned} S(\mathbf{X^a})= & {} \{ c_i \in {\mathcal {O}} | \forall c_j \in \mathbf{X^a} : c_j \le c_i\}; \\ LCS(\mathbf{X^a})= & {} \{ c \in S(\mathbf{X^a}) | \forall c_i \in S(\mathbf{X^a}) : c \le c_i\}. \end{aligned}$$

The semantic distance is defined as a function \(d_s: {\mathcal {O}} \times {\mathcal {O}} \rightarrow {\mathbb {R}}\) mapping a pair of concepts (corresponding to nominal values) to a real number that quantifies the difference between their meanings. According to the calculation principle employed, ontology-based measures can be divided in three families:

  1. 1.

    Edge-counting measures.

  2. 2.

    Feature-based measures.

  3. 3.

    Measures based on information content.

1.1 A. 1 Edge-counting measures

They estimate the semantic distance between concept pairs as a function of the length of the taxonomic path connecting the two concepts in the ontology [33].

A well-known edge-counting measure was proposed by Wu and Palmer [50]:

$$\begin{aligned} d_{\text {WP}} (c_1 , c_2) = 1 - \frac{2\times \text {depth}(LCS( c_1 , c_2 ))}{\mathrm{denominator}}, \end{aligned}$$
(10)

where \({\mathrm{denominator}} = 2\times \text {depth}(LCS( c_1 , c_2 )) + \text {path}(c_1, LCS( c_1 , c_2 )) + \text {path}(c_2, LCS( c_1 , c_2 ))\); \(LCS(c_1 ,c_2)\) is the most specific subsumer of \(c_1\) and \(c_2\) in the ontology; \(\text {depth}(LCS( c_1 , c_2 ))\) is the number of nodes in the longest taxonomic path between the \(LCS(c_1 ,c_2 )\) and the node root of the taxonomy; and \(\text {path}(c_i, LCS( c_1,\)\(c_2 ))\) is the number of taxonomic edges in the shortest taxonomic path between the two concepts.

Simplicity is the main advantage of edge-counting measures. However, they present some shortcomings: (1) if they are applied to ontologies incorporating multiple taxonomical inheritance, several taxonomical paths are not taken into account, and (2) by considering only the paths (i.e., subsumers) between the concepts, much of the taxonomical knowledge explicitly modeled in the ontology is ignored.

Assuming that concepts in the ontology are linked with their ancestors through pointers, in the worst case (comparing the two most specific concepts in the ontology that have the root node as LCS), obtaining the \(LCS(c_1 ,c_2)\) requires running through the longest path in the taxonomy, i.e., twice the taxonomy depth D. Therefore, it takes O(D) cost to compute Expression (10).

1.2 A.2 Feature-based measures

They consider the degree of overlap between the sets of ontological features of the concepts to be compared. In [40], the authors suggested measuring the semantic distance as a function of taxonomic features, i.e., as the ratio between the number of non-common taxonomic ancestors and the total number of ancestors of the two concepts:

$$\begin{aligned}&d_{\mathrm{log}\text {SC}} ( c_1 , c_2 ) \nonumber \\&\quad = \log _{2} \left( 1 + \frac{|S(c_1)\cup S(c_2)|-|S(c_1)\cap S(c_2)|}{|S(c_1)\cup S(c_2)|}\right) , \end{aligned}$$
(11)

where \(S(c_i)\) is the set of taxonomic subsumers of the concept \(c_i\), for \(i = 1,2\). Due to the additional knowledge feature-based measures take into account (i.e., multiple direct ancestors in case of multiple inheritance), they tend to be more accurate than edge-counting measures [40].

If S is the maximum number of ancestors that a concept can have in the ontology, computing Expression (11) takes O(S) cost. Notice that, for ontologies without multiple inheritance, this cost is the same as the one of edge-counting measures.

1.3 A.3 Measures based on information content

They measure the semantic distance between two concepts as the inverse of the amount of information they share in the ontology, which is represented by their LCS [35]. In particular, Lin [30] proposed as a measure the inverse of the ratio between the information content of the LCS of the concepts and the sum of the information content of each concept.

$$\begin{aligned} d_{\text {lin}}(c_1,c_2) =1- \frac{IC(LCS(c_1,c_2))}{IC(c_1)+IC(c_2)}. \end{aligned}$$
(12)

In [41], IC(c) is intrinsically estimated within the ontology as the normalized ratio between the number of leaves (i.e., terminal hyponyms) under concept c in the taxonomy and the number of subsumers of c:

$$\begin{aligned} IC(c) = - \log \left( \frac{\frac{|\text {leaves}(c)|}{|S(c)|}+1}{|\text {max\_leaves}+1|}\right) . \end{aligned}$$
(13)

Thanks to IC-based measures exploiting the largest amount of ontological evidence (i.e., ancestors and leaves), they achieve better accuracy than edge-counting and feature-based measures [6].

Expression (12) requires computing the LCS of the two concepts, plus the ICs of the LCS and the concepts. Like in edge-counting measures, computing the LCS has a worst-case complexity O(D). On the other hand, Expression (13) requires obtaining all the possible concepts connected to c, either subsumers of hyponyms; hence, in the worst case (i.e., when c is the root node, which subsumes all the concepts in the ontology), the IC computation takes O(C) cost, where C is the total number of concepts in the taxonomy. In conclusion, Expression (12) has \(O(C+D)\) computational cost. Thus, IC-based measures are not only the most accurate but also the costliest.

B Security of the scalar product protocols used

1.1 B. 1 Proof of Proposition 1

Charlie receives \(\varvec{r}'_x\) from Alice. But \(\varvec{r}'_x\) can be obtained as the difference between \(\hat{\varvec{x}}'+{\varvec{k}}\) and \({\varvec{x}}+{\varvec{k}}\), where \({\varvec{k}}\) is an n-vector with all its components set to k and k is any real number. Hence, Charlie learns nothing about \(\varvec{x}\). A similar argument shows that Charlie learns nothing about \(\varvec{y}\).

Bob receives \(\hat{\varvec{x}}'\) from Alice and \({\varvec{r}}_y\) from Charlie. Clearly, \({\varvec{r}}_y\) contains no information on \({\varvec{x}}\). On the other hand,

$$\begin{aligned} \hat{\varvec{x}}' = {{{\mathcal {P}}}}_x(\hat{\varvec{x}})= {{{\mathcal {P}}}}_x({\varvec{x}}+{\varvec{r}}_x). \end{aligned}$$

Since \(\mathbf{P}_x\) is a random permutation, the probability of Bob’s learning \(\hat{\varvec{x}}\) from \(\hat{\varvec{x}}'\) is 1 over the number of permutations of \(\hat{\varvec{x}}\), that is

$$\begin{aligned} \frac{n^x_1! n^x_2! \ldots n^x_{d_x}!}{n!}, \end{aligned}$$

where \(d_x\) is the number of different values among the n values of \(\hat{\varvec{x}}\), and \(n^x_i\) is the number of repetitions of the ith different value. Since \(\hat{\varvec{x}}\) is the result of adding a random vector to \({\varvec{x}}\), it is highly unlikely that \(\hat{\varvec{x}}\) contains repeated values, so the probability of Bob’s learning \(\hat{\varvec{x}}\) is very low. Furthermore, Bob does not know \({\varvec{r}}_x\). Without knowledge of \(\hat{\varvec{x}}\) and \({\varvec{r}}_x\), Bob cannot learn \(\varvec{x}\).

The argument on the inability of Alice to learn \(\varvec{y}\) is analogous.

1.2 B.2 On the security of Protocol 2

Protocol 2 is a variation of a protocol proposed in [23]. The latter protocol takes place only between Alice and Bob and there is no CLARUS proxy. Thus it differs from Protocol 2 in the last three steps, which are as follows:

  1. 4.

    Bob generates a random plaintext \(s_B\), a random number \(r'\) and sends \(\omega ' = \omega Enc_{p_k}(-s_B;r')\) to Alice.

  2. 5.

    Alice computes \(s_A = Dec_{s_k}(\omega ') = \varvec{x}^T \varvec{y} - s_B\).

  3. 6.

    Alice and Bob simultaneously exchange the values \(s_A\) and \(s_B\), respectively, so that both can compute \(s_A + s_B = \varvec{x}^T \varvec{y}\).

The authors of [23] prove that, if Paillier’s cryptosystem is secure, Alice cannot learn \({\varvec{y}}\) and Bob cannot learn \(\mathbf{x}\) in their protocol.

The only modification introduced by Protocol 2 is that Alice and Bob do not share their results \(s_A\) and \(s_B\), but they send these values to CLARUS. Since neither Alice nor Bob have more information than in the protocol of [23], the security of the latter protocol is preserved in Protocol 2.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Domingo-Ferrer, J., Sánchez, D., Ricci, S. et al. Outsourcing analyses on privacy-protected multivariate categorical data stored in untrusted clouds. Knowl Inf Syst 62, 2301–2326 (2020). https://doi.org/10.1007/s10115-019-01424-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-019-01424-4

Keywords

Navigation