Privacy-preserving high-dimensional data publishing for classification
Introduction
Various organizations, such as government agencies and healthcare systems, often share the data they have collected (e.g., census data or medical records) with third-parties for some specific data analysis (Doshi, Jefferson, Mar, 2012, Janssen, Charalabidis, Zuiderwijk, 2012). However, directly releasing raw data may violate the privacy of individuals from whom the data were obtained and some potential risks may be incurred, such as stalking or cheating. This is because the adversary can combine released data with other external sources, such as voter lists, to re-identify individual records. Data anonymization is an important way of protecting individual privacy of the raw data before publication. Many privacy models (Li, Li, Venkatasubramanian, 2007, Loukides, Gkoulalas-Divanis, 2012, Machanavajjhala, Gehrke, Kifer, Venkitasubramaniam, 2006, Sun, Wang, Li, Zhang, 2012, Sweeney, 2002, Terrovitis, Mamoulis, Kalnis, 2008, Wang, Meng, Bamba, Liu, Pu, 2009, Zhu, Li, Zhou, Philip, 2017) have been proposed in the literature. Unfortunately, as dimensionality increases, most existing privacy models cannot handle high-dimensional data effectively (Aggarwal, 2007, Gabriel, Tao, Kalnis, 2008).
There are two reasons for the ineffectiveness of existing privacy models in dealing with high-dimensional data. First, the case of high-dimensionality results in a huge degradation in the utility of the anonymized data (Fung et al., 2012). As the number of dimensions increases, more attributes may be available for adversarial attacks because the adversary can associate more attributes with his or her prior knowledge to identify a target victim. To better protect individual privacy, a larger perturbation is required, therefore resulting in the anonymized data of unacceptably low utility. Second, it may not be possible to design any feasible solution to meet some pre-specified privacy requirements. It has been shown that some anonymization techniques, such as generalization-based approaches (Bredereck, Nichterlein, Niedermeier, Philip, 2014, He, Naughton, 2009, Loukides, Gkoulalas-Divanis, Shao, 2013), depend on the ambiguity between different data points within a specified spatial locality to preserve privacy. However, because data points in a very high-dimensional space become sparse, the distances between these data points are less distinctive. Thus, the concept of spatial locality becomes ill-defined (Aggarwal, 2007), which leads to the impracticality of preserving privacy by any of the data anonymization models.
Some privacy models have been proposed to address the problem of dimensionality. For example, Mohammed et al. (Mohammed et al., 2010) proposed the LKC-privacy model for anonymizing high-dimensional data. Based on the assumption that the adversary’s prior knowledge is less than L, LKC-privacy ensures that each combination of quasi-identifying values with maximum length L is shared by at least K records. Instead of processing all potential quasi-identifying values, the model only deals with L values at each iteration. However, if the data publisher sets L to a large number to obtain a rigorous privacy guarantee, LKC-privacy cannot essentially deal with high-dimensional data. On the other hand, techniques of feature selection (Chandrashekar and Sahin, 2014) and feature transformation (Varghese et al., 2012) are used extensively to address the challenges of high-dimensionality in data mining tasks. However, these techniques may not be directly applicable to data publishing scenarios, because the data publisher cannot accurately predict what kinds of attributes the data recipient will need and how these attributes will be analyzed in the future. Feature selection techniques may exclude many attributes from the raw data, however, these attributes that have been removed may contribute a lot in some specific data analysis; feature transformation techniques usually produce transformed attributes of poor interpretability.
We set our data publishing scenario as follows. The data owner wants to release high-dimensional, person-specific data (similar to Table 1) to some recipients for classification. However, if a record in the raw data is so specific that not many individuals can match it, directly releasing the data without explicit identifiers may lead to the easy re-identification of the record. For example, assume that Alice knows that Bob’s information is in Table 1 and he is in his twenties. Then the 8th record of Table 1 can be uniquely linked to Bob because the record is the only male in his twenties in the table. Meanwhile, identifying his record results in the disclosure of his other information. The problem studied in this paper is to generate an anonymous version of the raw data so that it can resist record linkage attacks and ensure the classification accuracy.
In this paper, we present an approach to anonymize high-dimensional data in the context of k-anonymity and classification analysis. To address the challenge of high-dimensionality faced by the k-anomymity model and the loss of information caused by feature reduction techniques, the proposed approach uses an idea of vertical partition. Intuitively, the idea is to vertically divide raw data into different disjointed data subsets, each of them containing fewer attributes but the same number of data records, and then to anonymize each subset separately. During the vertical partition phase, both the correlation between attributes and the proportion of quasi-identifying attributes in each subset are taken into account for the purpose of maintaining utility. Given that the optimal implementation of k-anonymity has been proven to be NP-hard (Meyerson and Williams, 2004), a heuristic method based on local recoding is proposed to mask each subset. Compared to the raw high-dimensional data, each subset of smaller dimensionality can be anonymized efficiently and effectively. After data recipients receive the released data, they can use all or part of the anonymized subsets to build classifiers and combine these classifiers for category predictions. Although the paper focuses on k-anonymity for simplicity, other privacy models, such as l-diversity (Machanavajjhala et al., 2006) and (ϵ, δ)k-dissimilarity (Wang et al., 2009), can easily be integrated into our work by using different search criteria during the local recoding phase.
The rest of the paper is organized as follows. Preliminaries are presented in Section 2, and related works are reviewed in Section 3. The proposed approach is described in Section 4, and the experimental results are analyzed in Section 5. Section 6 concludes the paper.
Section snippets
Preliminaries
For convenience of subsequent discussion, basic concepts and definitions used in this paper are introduced briefly, followed by the concept of local recoding.
Related works
The problem of data privacy disclosure has been studied extensively. We briefly review some traditional privacy models. Subsequently, researches on high dimensionality, data partition and local recoding in existing anonymization approaches are shortly surveyed.
To resist record linkage attacks, k-anonymity (Sweeney, 2002) requires that each record be indistinguishable from at least () other records in terms of the QI attribute. To further prevent attribute linkage, l-diversity (
Proposed approach
Our proposed approach consists of the following two phases. First, the raw data are vertically divided into different disjointed subsets/fragments, each of them containing smaller number of attributes but the same number of records. Second, each fragment is anonymized independently for achieving k-anonymity. Through these phases, the proposed approach can ameliorate the burden of high-dimensionality and generate quality anonymized data by ensuring the classification accuracy. We next elaborate
Experimental evaluation
In this section, we evaluate the effectiveness and efficiency of our proposed approach. First, we validate the performance of our approach by tuning different numbers of fragments and anonymity levels. Second, we compare our approach with two other anonymization methods, i.e., LKC-privacy (Mohammed et al., 2010) and δ-selective (Zakerzadeh et al., 2016). Third, we evaluate the scalability of our approach. The experiments were performed on a PC with a 3.4 GHz @Intel core i7 CPU and 16 GB of RAM
Conclusions
An approach to anonymizing high-dimensional data was proposed in this paper. The quasi-identifying attributes were partitioned into different disjointed subsets to alleviate the burden of high-dimensionality. Both the predictability of the class values from attributes and the redundancy of these attributes were taken into account in the partition phase. For each subset, equivalence classes were generated by a cluster-based local recoding method and then were generalized for achieving anonymity.
Declaration of Competing Interest
None.
Acknowledgments
The research was supported by the Sichuan Province Science and Technology Program (2019YFSY0032) and China Scholarship Council (201807000083). We also would like to thank the editors and anonymous reviewers for their helpful comments that have led to an improved version of this paper.
Rong Wang is now a Ph.D. candidate in the School of Information Science and Technology at Southwest Jiaotong University, Chengdu. Her current research interests include machine learning, data mining with big data, and privacy preserving data mining.
References (34)
- et al.
A survey on feature selection methods
Comput. Electr. Eng.
(2014) - et al.
Utility-preserving transaction data anonymization with low information loss
Expert Syst. Appl.
(2012) On randomization, public information and the curse of dimensionality
Proceedings of IEEE 23rd International Conference on Data Engineering
(2007)- et al.
The effect of homogeneity on the computational complexity of combinatorial data anonymization
Data Min. Knowl. Discov.
(2014) - et al.
Efficient k-anonymization using clustering techniques
Proceedings of International Conference on Database Systems for Advanced Applications
(2007) - et al.
SABRE: A sensitive attribute bucketization and redistribution framework for t-closeness
The VLDB Journal
(2011) - et al.
Anonymization in the time of big data
Privacy in Statistical Databases
(2016) - et al.
The imperative to share clinical study reports: recommendations from the tamiflu experience
PLoS Med.
(2012) - et al.
Normalized mutual information feature selection
IEEE Trans. Neural Networks
(2009) - et al.
Transmission of information: a statistical theory of communications
Am J Phys
(1961)
Service-oriented architecture for high-dimensional private data mashup
IEEE Trans. Serv. Comput.
On the anonymization of sparse high-dimensional data
Proceedings of IEEE 24th International Conference on Data Engineering
Data mining: concepts and techniques
Anonymization of set-valued data via top-down, local generalization
Proceedings of the VLDB Endowment
Benefits, adoption barriers and myths of open data and open government
Information Systems Management
t-closeness: Privacy beyond k-anonymity and l-diversity
Proceedings of the 23rd International Conference on Data Engineering Data
Closeness: a new privacy measure for data publishing
IEEE Trans Knowl Data Eng
Cited by (21)
A hybrid deep learning framework for privacy preservation in edge computing
2023, Computers and SecurityA decision-support framework for data anonymization with application to machine learning processes
2022, Information SciencesCitation Excerpt :This model reduces the data dimension significantly, and prunes unnecessary candidate sequences through a length-based frequent pattern tree (LFP-Tree) to improve data utility while satisfying k-anonymity. To manage the k-anonymity satisfiability over high dimensional data, [46] propose a novel heuristic method based on local recording. The approach vertically divides raw data into disjoint subsets to be anonymized, and exploits the k-anonymity requirements together with attribute correlations to guarantee a suitable level of data utility, measured by accuracy, F-measure, and information loss.
A privacy-preserving method for publishing data with multiple sensitive attributes
2024, CAAI Transactions on Intelligence TechnologyRole of perceived risks and perceived benefits on consumers behavioural intention to use Buy-Now, Pay-Later (BNPL) services
2024, Journal of Facilities ManagementThe applications of Internet of Things (IoT) in industrial management: a science mapping review
2024, International Journal of Production ResearchAnalysis of the application of abstract symbols in advertising design based on cluster analysis method
2024, Applied Mathematics and Nonlinear Sciences
Rong Wang is now a Ph.D. candidate in the School of Information Science and Technology at Southwest Jiaotong University, Chengdu. Her current research interests include machine learning, data mining with big data, and privacy preserving data mining.
Yan Zhu received her Ph.D. degree in computer science from Darmstadt University of Technology (TU Darmstadt), Darmstadt, in 2004. She also was a research staff at Department of Computer Science of TU Darmstadt from 1998 to 2004. She is now a professor in the S chool of Information Science and Technology at Southwest Jiaotong University, Chengdu. She has published two academic books in English and Chinese, and about 60 journal/conference papers. Her current research interests include web data mining, privacy preserving data mining, and big data management and analysis.
Chin-Chen Chang received both his B.S. degree in applied mathematics in 1977 and his M.S. degr ee in computer and decision sciences in 1979 from the National Tsinghua University, Hsinchu, and his Ph.D. degree in computer engineering in 1982 from the National Chiao Tung University, Hsinchu. He is now a chair professor at the Department of Inf ormation Engineering and Computer Science, Feng Chia University, Taichuang He is also a fellow of IEEE and a fellow of IEE, United Kingdom. His current research interests include computer cryptography and information security, cloud computing, data engineering, and database systems.
Qiang Peng received his B.E. degree in automation control from Xi’an Jiaotong University, Xi’an, China, his M.Eng in computer application and technology and his Ph.D. degree in traffic information and control engineering from Southwest Jiaotong University, Chengdu, China, in 1984, 1987, and 2004, respectively. He is currently a p rofessor at the School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China. He has published over 80 papers and holds 10 Chinese patents. His researc h interests include digital video compression and transmission, image/graphics processing, traffic information detection and simulation, virtual reality technology, multimedia system and data mining