Privacy-preserving high-dimensional data publishing for classification

https://doi.org/10.1016/j.cose.2020.101785Get rights and content

Abstract

With increasing amounts of personal information being collected by various organizations, many privacy models have been proposed for masking the collected data so that the data can be published without releasing individual privacy. However, most existing privacy models are not applicable to high-dimensional data, because of the sparseness of high-dimensional search space. In this paper, we present our solution to release high-dimensional data for privacy preservation and classification analysis. The challenge facing us is how to reduce high dimensions from the perspective of privacy models while preserving as much information as possible for classification. Our proposed approach tackles it by using an idea of vertical partition, which is to vertically divide the raw data into different disjointed subsets of smaller dimensionality. Specifically, our partition metric considers both the correlation between attributes and the proportion of attributes in each subset. Then a generalization method based on local recoding is employed to each subset separately for achieving k-anonymity. Considering the hardness of the optimal implementation of k-anonymity, the local recoding method finds a near-optimal solution with the goal of improving efficiency. The proposed approach was evaluated using two datasets, and the experimental results showed that it outperformed two related approaches in data utility at the same privacy level.

Introduction

Various organizations, such as government agencies and healthcare systems, often share the data they have collected (e.g., census data or medical records) with third-parties for some specific data analysis (Doshi, Jefferson, Mar, 2012, Janssen, Charalabidis, Zuiderwijk, 2012). However, directly releasing raw data may violate the privacy of individuals from whom the data were obtained and some potential risks may be incurred, such as stalking or cheating. This is because the adversary can combine released data with other external sources, such as voter lists, to re-identify individual records. Data anonymization is an important way of protecting individual privacy of the raw data before publication. Many privacy models (Li, Li, Venkatasubramanian, 2007, Loukides, Gkoulalas-Divanis, 2012, Machanavajjhala, Gehrke, Kifer, Venkitasubramaniam, 2006, Sun, Wang, Li, Zhang, 2012, Sweeney, 2002, Terrovitis, Mamoulis, Kalnis, 2008, Wang, Meng, Bamba, Liu, Pu, 2009, Zhu, Li, Zhou, Philip, 2017) have been proposed in the literature. Unfortunately, as dimensionality increases, most existing privacy models cannot handle high-dimensional data effectively (Aggarwal, 2007, Gabriel, Tao, Kalnis, 2008).

There are two reasons for the ineffectiveness of existing privacy models in dealing with high-dimensional data. First, the case of high-dimensionality results in a huge degradation in the utility of the anonymized data (Fung et al., 2012). As the number of dimensions increases, more attributes may be available for adversarial attacks because the adversary can associate more attributes with his or her prior knowledge to identify a target victim. To better protect individual privacy, a larger perturbation is required, therefore resulting in the anonymized data of unacceptably low utility. Second, it may not be possible to design any feasible solution to meet some pre-specified privacy requirements. It has been shown that some anonymization techniques, such as generalization-based approaches (Bredereck, Nichterlein, Niedermeier, Philip, 2014, He, Naughton, 2009, Loukides, Gkoulalas-Divanis, Shao, 2013), depend on the ambiguity between different data points within a specified spatial locality to preserve privacy. However, because data points in a very high-dimensional space become sparse, the distances between these data points are less distinctive. Thus, the concept of spatial locality becomes ill-defined (Aggarwal, 2007), which leads to the impracticality of preserving privacy by any of the data anonymization models.

Some privacy models have been proposed to address the problem of dimensionality. For example, Mohammed et al. (Mohammed et al., 2010) proposed the LKC-privacy model for anonymizing high-dimensional data. Based on the assumption that the adversary’s prior knowledge is less than L, LKC-privacy ensures that each combination of quasi-identifying values with maximum length L is shared by at least K records. Instead of processing all potential quasi-identifying values, the model only deals with L values at each iteration. However, if the data publisher sets L to a large number to obtain a rigorous privacy guarantee, LKC-privacy cannot essentially deal with high-dimensional data. On the other hand, techniques of feature selection (Chandrashekar and Sahin, 2014) and feature transformation (Varghese et al., 2012) are used extensively to address the challenges of high-dimensionality in data mining tasks. However, these techniques may not be directly applicable to data publishing scenarios, because the data publisher cannot accurately predict what kinds of attributes the data recipient will need and how these attributes will be analyzed in the future. Feature selection techniques may exclude many attributes from the raw data, however, these attributes that have been removed may contribute a lot in some specific data analysis; feature transformation techniques usually produce transformed attributes of poor interpretability.

We set our data publishing scenario as follows. The data owner wants to release high-dimensional, person-specific data (similar to Table 1) to some recipients for classification. However, if a record in the raw data is so specific that not many individuals can match it, directly releasing the data without explicit identifiers may lead to the easy re-identification of the record. For example, assume that Alice knows that Bob’s information is in Table 1 and he is in his twenties. Then the 8th record of Table 1 can be uniquely linked to Bob because the record is the only male in his twenties in the table. Meanwhile, identifying his record results in the disclosure of his other information. The problem studied in this paper is to generate an anonymous version of the raw data so that it can resist record linkage attacks and ensure the classification accuracy.

In this paper, we present an approach to anonymize high-dimensional data in the context of k-anonymity and classification analysis. To address the challenge of high-dimensionality faced by the k-anomymity model and the loss of information caused by feature reduction techniques, the proposed approach uses an idea of vertical partition. Intuitively, the idea is to vertically divide raw data into different disjointed data subsets, each of them containing fewer attributes but the same number of data records, and then to anonymize each subset separately. During the vertical partition phase, both the correlation between attributes and the proportion of quasi-identifying attributes in each subset are taken into account for the purpose of maintaining utility. Given that the optimal implementation of k-anonymity has been proven to be NP-hard (Meyerson and Williams, 2004), a heuristic method based on local recoding is proposed to mask each subset. Compared to the raw high-dimensional data, each subset of smaller dimensionality can be anonymized efficiently and effectively. After data recipients receive the released data, they can use all or part of the anonymized subsets to build classifiers and combine these classifiers for category predictions. Although the paper focuses on k-anonymity for simplicity, other privacy models, such as l-diversity (Machanavajjhala et al., 2006) and (ϵ, δ)k-dissimilarity (Wang et al., 2009), can easily be integrated into our work by using different search criteria during the local recoding phase.

The rest of the paper is organized as follows. Preliminaries are presented in Section 2, and related works are reviewed in Section 3. The proposed approach is described in Section 4, and the experimental results are analyzed in Section 5. Section 6 concludes the paper.

Section snippets

Preliminaries

For convenience of subsequent discussion, basic concepts and definitions used in this paper are introduced briefly, followed by the concept of local recoding.

Related works

The problem of data privacy disclosure has been studied extensively. We briefly review some traditional privacy models. Subsequently, researches on high dimensionality, data partition and local recoding in existing anonymization approaches are shortly surveyed.

To resist record linkage attacks, k-anonymity (Sweeney, 2002) requires that each record be indistinguishable from at least (k1) other records in terms of the QI attribute. To further prevent attribute linkage, l-diversity (

Proposed approach

Our proposed approach consists of the following two phases. First, the raw data are vertically divided into different disjointed subsets/fragments, each of them containing smaller number of attributes but the same number of records. Second, each fragment is anonymized independently for achieving k-anonymity. Through these phases, the proposed approach can ameliorate the burden of high-dimensionality and generate quality anonymized data by ensuring the classification accuracy. We next elaborate

Experimental evaluation

In this section, we evaluate the effectiveness and efficiency of our proposed approach. First, we validate the performance of our approach by tuning different numbers of fragments and anonymity levels. Second, we compare our approach with two other anonymization methods, i.e., LKC-privacy (Mohammed et al., 2010) and δ-selective (Zakerzadeh et al., 2016). Third, we evaluate the scalability of our approach. The experiments were performed on a PC with a 3.4 GHz @Intel core i7 CPU and 16 GB of RAM

Conclusions

An approach to anonymizing high-dimensional data was proposed in this paper. The quasi-identifying attributes were partitioned into different disjointed subsets to alleviate the burden of high-dimensionality. Both the predictability of the class values from attributes and the redundancy of these attributes were taken into account in the partition phase. For each subset, equivalence classes were generated by a cluster-based local recoding method and then were generalized for achieving anonymity.

Declaration of Competing Interest

None.

Acknowledgments

The research was supported by the Sichuan Province Science and Technology Program (2019YFSY0032) and China Scholarship Council (201807000083). We also would like to thank the editors and anonymous reviewers for their helpful comments that have led to an improved version of this paper.

Rong Wang is now a Ph.D. candidate in the School of Information Science and Technology at Southwest Jiaotong University, Chengdu. Her current research interests include machine learning, data mining with big data, and privacy preserving data mining.

References (34)

  • G. Chandrashekar et al.

    A survey on feature selection methods

    Comput. Electr. Eng.

    (2014)
  • G. Loukides et al.

    Utility-preserving transaction data anonymization with low information loss

    Expert Syst. Appl.

    (2012)
  • C.C. Aggarwal

    On randomization, public information and the curse of dimensionality

    Proceedings of IEEE 23rd International Conference on Data Engineering

    (2007)
  • R. Bredereck et al.

    The effect of homogeneity on the computational complexity of combinatorial data anonymization

    Data Min. Knowl. Discov.

    (2014)
  • J.-W. Byun et al.

    Efficient k-anonymization using clustering techniques

    Proceedings of International Conference on Database Systems for Advanced Applications

    (2007)
  • J. Cao et al.

    SABRE: A sensitive attribute bucketization and redistribution framework for t-closeness

    The VLDB Journal

    (2011)
  • J. Domingo-Ferrer et al.

    Anonymization in the time of big data

    Privacy in Statistical Databases

    (2016)
  • P. Doshi et al.

    The imperative to share clinical study reports: recommendations from the tamiflu experience

    PLoS Med.

    (2012)
  • P.A. Estévez et al.

    Normalized mutual information feature selection

    IEEE Trans. Neural Networks

    (2009)
  • R.M. Fano et al.

    Transmission of information: a statistical theory of communications

    Am J Phys

    (1961)
  • B.C. Fung et al.

    Service-oriented architecture for high-dimensional private data mashup

    IEEE Trans. Serv. Comput.

    (2012)
  • G. Gabriel et al.

    On the anonymization of sparse high-dimensional data

    Proceedings of IEEE 24th International Conference on Data Engineering

    (2008)
  • J. Han et al.

    Data mining: concepts and techniques

    (2011)
  • Y. He et al.

    Anonymization of set-valued data via top-down, local generalization

    Proceedings of the VLDB Endowment

    (2009)
  • M. Janssen et al.

    Benefits, adoption barriers and myths of open data and open government

    Information Systems Management

    (2012)
  • N. Li et al.

    t-closeness: Privacy beyond k-anonymity and l-diversity

    Proceedings of the 23rd International Conference on Data Engineering Data

    (2007)
  • N. Li et al.

    Closeness: a new privacy measure for data publishing

    IEEE Trans Knowl Data Eng

    (2010)
  • Cited by (21)

    • A decision-support framework for data anonymization with application to machine learning processes

      2022, Information Sciences
      Citation Excerpt :

      This model reduces the data dimension significantly, and prunes unnecessary candidate sequences through a length-based frequent pattern tree (LFP-Tree) to improve data utility while satisfying k-anonymity. To manage the k-anonymity satisfiability over high dimensional data, [46] propose a novel heuristic method based on local recording. The approach vertically divides raw data into disjoint subsets to be anonymized, and exploits the k-anonymity requirements together with attribute correlations to guarantee a suitable level of data utility, measured by accuracy, F-measure, and information loss.

    View all citing articles on Scopus

    Rong Wang is now a Ph.D. candidate in the School of Information Science and Technology at Southwest Jiaotong University, Chengdu. Her current research interests include machine learning, data mining with big data, and privacy preserving data mining.

    Yan Zhu received her Ph.D. degree in computer science from Darmstadt University of Technology (TU Darmstadt), Darmstadt, in 2004. She also was a research staff at Department of Computer Science of TU Darmstadt from 1998 to 2004. She is now a professor in the S chool of Information Science and Technology at Southwest Jiaotong University, Chengdu. She has published two academic books in English and Chinese, and about 60 journal/conference papers. Her current research interests include web data mining, privacy preserving data mining, and big data management and analysis.

    Chin-Chen Chang received both his B.S. degree in applied mathematics in 1977 and his M.S. degr ee in computer and decision sciences in 1979 from the National Tsinghua University, Hsinchu, and his Ph.D. degree in computer engineering in 1982 from the National Chiao Tung University, Hsinchu. He is now a chair professor at the Department of Inf ormation Engineering and Computer Science, Feng Chia University, Taichuang He is also a fellow of IEEE and a fellow of IEE, United Kingdom. His current research interests include computer cryptography and information security, cloud computing, data engineering, and database systems.

    Qiang Peng received his B.E. degree in automation control from Xi’an Jiaotong University, Xi’an, China, his M.Eng in computer application and technology and his Ph.D. degree in traffic information and control engineering from Southwest Jiaotong University, Chengdu, China, in 1984, 1987, and 2004, respectively. He is currently a p rofessor at the School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China. He has published over 80 papers and holds 10 Chinese patents. His researc h interests include digital video compression and transmission, image/graphics processing, traffic information detection and simulation, virtual reality technology, multimedia system and data mining

    View full text