Abstract
The curse of dimensionality has remained a challenge for a wide variety of algorithms in data mining, clustering, classification, and privacy. Recently, it was shown that an increasing dimensionality makes the data resistant to effective privacy. The theoretical results seem to suggest that the dimensionality curse is a fundamental barrier to privacy preservation. However, in practice, we show that some of the common properties of real data can be leveraged in order to greatly ameliorate the negative effects of the curse of dimensionality. In real data sets, many dimensions contain high levels of inter-attribute correlations. Such correlations enable the use of a process known as vertical fragmentation in order to decompose the data into vertical subsets of smaller dimensionality. An information-theoretic criterion of mutual information is used in the vertical decomposition process. This allows the use of an anonymization process, which is based on combining results from multiple independent fragments. We present a general approach, which can be applied to the k-anonymity, \(\ell \)-diversity, and t-closeness models. In the presence of inter-attribute correlations, such an approach continues to be much more robust in higher dimensionality, without losing accuracy. We present experimental results illustrating the effectiveness of the approach. This approach is resilient enough to prevent identity, attribute, and membership disclosure attack.









Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Notes
J48 is an open source implementation of C4.5 in Java, http://weka.sourceforge.net/doc/weka/classifiers/trees/J48.html.
Recall that fragmentation is the first step in the k-anonymity and step 3 in the \(\ell \)-diversity algorithm.
Here, the power of an attacker means the number of quasi-identifier that he/she is aware of.
The work in [12] only considers the \(\ell \)-diversity model, so it is not suitable for comparison with our work.
There are 7 classes. However, there exist instances for only 6 classes in this training sample.
This data set has only 36 features.
We adopt distinct \(\ell \)-diversity in our experiments.
Note that this metric is independent of anonymity degree (k or \(\ell \)) because it is calculated before anonymization algorithm is applied on the data.
References
Agrawal R, Srikant R (2000) Privacy-preserving data mining. In: SIGMOD
Aggarwal CC (2005) On \(k\)-anonymity and the curse of dimensionality. In: VLDB
Aggarwal CC (2007) On randomization, public information, and the curse of dimensionality. In: ICDE
Aggarwal CC (2008) Privacy and the dimensionality curse. In: Aggarwal C, Yu PS (eds) Privacy preserving data mining: models and algorithms. Springer, Berlin
Agrawal S, Haritsa J (2005) A framework for high accuracy privacy-preserving data mining. In: ICDE
Aggarwal CC, Yu PS (2008) Privacy preserving data mining: models and algorithms. Springer, Berlin
Chow C-Y, Mokbel MF (2011) Trajectory privacy in location-based services and data publication. ACM SIGKDD Explor Newsl 13(1):19–29
Ciriani V, Capitani Di Vimercati S, Foresti S, Jajodia S, Paraboschi S, Samarati P (2010) Combining fragmentation and encryption to protect privacy in data storage. ACM Trans Inf Syst Secur 13(3):1–33
Ding C, Peng H (2003) Minimum redundancy feature selection from microarray gene expression data. In: CSB
Ganapathy V, Thomas D, Feder T, Garcia-Molina H, Motwani R (2011) Distributing data for secure database services. In: PAIS workshop
Ghinita G, Karras P, Kalnis P, Mamoulis N (2007) Fast data anonymization with low information loss. In: VLDB
Ghinita G, Tao Y, Kalnis P (2008) On the anonymization of sparse high-dimensional data. In: ICDE
Iyengar V (2002) Transforming data to satisfy privacy constraints. In: KDD
Kifer D (2009) Attacks on privacy and deFinetti’s theorem. In: SIGMOD
LeFevre K, DeWitt DJ, Ramakrishnan R (2006) Mondrian multidimensional \(k\)-anonymity. In: ICDE
LeFevre K, DeWitt DJ, Ramakrishnan R (2006) Workload-aware anonymization. In: KDD
LeFevre K, DeWitt D, Ramakrishnan R (2005) Incognito: efficient full-domain \(k\)-anonymity. In: SIGMOD
Li F, Sun J, Papadimitriou S, Mihaila G, Stanoi I (2007) Hiding in the crowd: privacy preservation on evolving streams through correlation tracking. In: ICDE
Li N, Li T, Venkatasubramaniam S (2007) \(t\)-diversity. In: ICDE
Li T, Li N, Zhang J, Molloy I (2012) Slicing: a new approach for privacy preserving data publishing. IEEE Trans Knowl Data Eng 24(3):561–574
Liu H, Motoda H (2007) Computational methods for feature selection. Chapman and Hall/CRC, London (data mining and knowledge discovery series)
Machanvajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M (2006) \(\ell \)-anonymity. In: ICDE
Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Samarati P (2001) Protecting Respondents’ identities in microdata release. IEEE Trans Knowl Data Eng 13(6):1010–1027
Vaidya J, Clifton C (2002) Privacy-preserving association rule mining in vertically partitioned data. In: KDD
Xiao X, Tao Y (2006) Anatomy: simple and effective privacy preservation. In: VLDB
Wong W, Mamoulis N, Cheung D (2010) Non-homogeneous generalization in privacy preserving data publishing. In: SIGMOD
Xue M, Karras P, Raissi C, Vaidya J, Tan K (2012) Anonymizing set-valued data by nonreciprocal recoding. In: KDD
Terrovitis M, Mamoulis N, Kalnis P (2008) Privacy-preserving anonymization of set-valued data. In: VLDB
Xu Y, Wang K, Fu AW, Yu PS (2008) Anonymizing transaction databases for publication. In: KDD
Mohammad N, Fung B, Hung P, Lee C (2009) Anonymizing healthcare data: a case study on the blood transfusion service. In: KDD
Ercan Nergiz M, Atzori M, Clifton C (2007) Hiding the presence of individuals from shared databases. In: SIGMOD
Zakerzadeh H, Aggarwal CC, Barker K (2014) Towards breaking the curse of dimensionality for high-dimensional privacy. In: SDM
Kifer D, Gehrke J (2006) Injecting utility into anonymized datasets. In: SIGMOD
Mohammed N, Fung B, Hung P, Lee C (2010) Centralized and distributed anonymization for high-dimensional healthcare data. In: TKDD
Terrovitis M, Liagouris J, Mamoulis N, Skiadopoulos S (2012) Privacy preservation by disassociation. In: VLDB
Nergiz M, Atzori M, Clifton C (2007) Hiding the presence of individuals from shared databases. In: SIGMOD
Dwork C (2006) Differntial privacy. In: ICALP
Dwork C, McSherry F, Nissim K, Smith A (2006) Calibrating noise to sensitivity in private data analysis. In: TCC
Hotelling H (1993) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24(6):417–441
He Y, Naughton J (2009) Anonymization of set-valued data via top-down, local generalization. In: VLDB
Zakerzadeh H, Osborn SL (2013) Delay-sensitive approaches for anonymizing numerical streaming data. Int J Inf Sec 12(5):423–437
Cao J, Karras P (2012) Publishing microdata with a robust privacy guarantee. In: VLDB
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zakerzadeh, H., Aggarwal, C.C. & Barker, K. Managing dimensionality in data privacy anonymization. Knowl Inf Syst 49, 341–373 (2016). https://doi.org/10.1007/s10115-015-0906-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-015-0906-8