Elsevier

Decision Support Systems

Volume 48, Issue 1, December 2009, Pages 133-140
Decision Support Systems

Identity disclosure protection: A data reconstruction approach for privacy-preserving data mining

https://doi.org/10.1016/j.dss.2009.07.003Get rights and content

Abstract

Identity disclosure is one of the most serious privacy concerns in today's information age. A well-known method for protecting identity disclosure is k-anonymity. A dataset provides k-anonymity protection if the information for each individual in the dataset cannot be distinguished from at least k  1 individuals whose information also appears in the dataset. There is a flaw in k-anonymity that would still allow an intruder to discern the confidential information of individuals in the anonymized data. To overcome this problem, we propose a data reconstruction approach to achieve k-anonymity protection in predictive data mining. In this approach, the potentially identifying attributes are first masked using aggregation (for numeric data) and swapping (for nominal data). A genetic algorithm technique is then applied to the masked data to find a good subset of it. This subset is then replicated to form the released dataset that satisfies the k-anonymity constraint.

Introduction

Data-mining technologies have enabled organizations to extract useful knowledge from the data in order to better understand and serve their customers, and to gain competitive advantages [6], [21], [26]. While successful business applications of data mining are encouraging, there are increasing concerns about invasions to the privacy of personal information. A survey by Time/CNN [16] revealed that 93% of respondents believed companies selling personal data should be required to gain permission from the individuals whose information is being shared. In another study [9], more than 70% of participants responded negatively to questions related to the secondary use of private information. Concern about privacy threats has caused data quality and integrity to deteriorate. According to [34], 82% of online users have refused to give personal information and 34% have lied when asked about their personal habits and preferences.

This study deals with the conflict between privacy and data mining in organizational decision support. Organizations that use their customers' records in data-mining activities are obligated to take actions to protect the identities of the individuals involved. It has been demonstrated that personal identities cannot be adequately protected by simply removing identity attributes from released data. There has been extensive research in the area of statistical databases (SDBs) on how to protect individuals' sensitive data when providing summary statistical information. The privacy issue arises in SDBs when summary statistics are derived on very few individuals' data. In this case, releasing the summary statistics may result in disclosing confidential data. The methods for preventing such disclosure can be broadly classified into two categories: (i) query restriction, which prohibits queries that would reveal confidential data, and (ii) data perturbation, which alters individual data in a way such that the summary statistics remain approximately the same. In general, both methods have been extensively investigated and employed [1]. Problems in data mining are somewhat different from those in SDBs. A data-mining task, such as classification or numeric prediction, requires working on individual records contained in a dataset. As a result, query restriction is no longer applicable and data perturbation or anonymization becomes the primary approach for privacy protection in data mining. Further, predictive data mining essentially relies on discovering relationships between data attributes. Preserving such relationships may not be consistent with preserving summary statistics. Researchers in the data-mining community have proposed various methods to resolve the conflict between data mining and privacy protection [4], [7], [14], [22], [23]. For example, a method for building a decision tree classifier from perturbed data is proposed in [3]. A framework for mining association rules from transaction data that have been randomized is presented in [11]. A set of algorithms for hiding sensitive rules is proposed in [36]. Techniques for preserving privacy in distributed data mining are discussed in [8].

A well-known method for privacy protection, called k-anonymity, was proposed in [31], [33]. The basic idea is to anonymize the data such that each individual cannot be distinguished from a group of other individuals in the data. The method has gained increasing popularity in privacy-preserving data mining. However, the k-anonymity approach would, in some circumstances, still allow a data intruder to disclose the individual confidential information in the k-anonymized data. To overcome this problem, we propose a data reconstruction approach to achieve k-anonymity protection in predictive data mining. In this approach, the potentially identifying attributes are first masked using aggregation (for numeric data) and swapping (for nominal data), without considering the k-anonymity constraint. A genetic algorithm technique is then applied to the masked data to find a good subset of it. This subset is then replicated to form the released dataset that satisfies the k-anonymity constraint. An experimental study is conducted to show the effectiveness of the proposed method.

Section snippets

Identity and confidentiality disclosure problem

A common practice for protecting identity disclosure is to remove identity related attributes from released data. Sweeney [33] demonstrated that this is not adequate in protecting personal identities. In fact, the author showed that 87% of the population in the United States can be uniquely identified using three demographic attributes: gender, date of birth, and 5-digit zip code. These attributes are normally not considered identity attributes. However, since they can potentially be used to

The data reconstruction approach

This study deals with privacy protection problem in the context of predictive data mining. We focus our approach on classification analysis, which is a common data-mining task. The basic idea of our approach also applies to the other predictive data-mining tasks such as numerical prediction (regression). We do not, however, target unsupervised learning problems such as clustering and association rules mining (see [2], [13] for example studies in these areas). The objective of our approach is to

An illustrative example

In this section, we demonstrate our approach using the example data in Table 1(a). As mentioned earlier, Test Result is the confidential attribute in this dataset. Age and Marital Status are the QI attributes, and Blood Pressure and Blood Type are non-QI attributes. In k-anonymity, the QI attributes are masked while the other attributes are unchanged. To illustrate, let k = 3.

    • Step 1. Aggregate numeric Age values into discretized values (labeled as Age2). As described in Section 3.1, the attribute

Computational experiments and results

A set of numerical experiments was conducted using two real-world datasets. Both datasets were taken from the Machine Learning Repository of the University of California at Irvine [17]. The characteristics of these two datasets are described in Table 7.

The first dataset, Diabetes, contains 768 instances of patient information, with nine numeric and nominal attributes, including diagnostic result, age, number of times pregnant, and a few lab test measures. Diagnostic result was considered as the

Conclusion and discussion

This paper presents a novel instance selection method based on genetic algorithm for identity disclosure protection. We introduce a data reconstruction approach to achieve k-anonymity protection in privacy-preserving data mining. The empirical evaluation results indicate that our proposed approach can lead to significantly improved performance. The insights gained from this study can help business make effective decisions on privacy protection in data mining.

Our work illustrates the usefulness

Acknowledgement

This research is partially supported by funds from the Information Infrastructure Institute (iCube), Center for Information Protection Center, and College of Business at Iowa State University. We would like to thank the editor and three anonymous reviewers for their detailed comments that help improve the paper.

Dan Zhu is an associate professor in the Department of Logistics, Operations and Management Information Systems at the Iowa State University. She obtained her Ph.D. degree from Carnegie Mellon University. Her current research focuses on developing and applying intelligent and learning technologies to business and management that lies in decision support systems, information security and privacy, and business intelligence. Dr. Zhu’s research has been published in the Decision Support Systems,

References (37)

  • D.S. Chowdhury et al.

    Disclosure detection in multivariate categorical databases: auditing confidentiality protection through two new matrix operators

    Management Science

    (1999)
  • C. Clifton et al.

    Tools for privacy preserving distributed data mining

    SIGKDD Explorations

    (2002)
  • M. Culnan

    How did they get my name?: an exploratory investigation of consumer attitudes toward secondary information use

    MIS Quarterly

    (1993)
  • L. Ding et al.

    Some theoretical results about the computation time of evolutionary algorithms

  • A. Evfimievski et al.

    Privacy preserving mining of association rules

  • U.M. Fayyad et al.

    On the handling of continuous-valued attributes in decision tree generation

    Machine Learning

    (1992)
  • A. Friedman et al.

    Providing k-anonymity in data mining

    International Journal on Very Large Data Bases

    (2008)
  • R. Garfinkel et al.

    Privacy protection of binary confidential data against deterministic, stochastic, and insider threat

    Management Science

    (2002)
  • Cited by (38)

    • A reversible data transform algorithm using integer transform for privacy-preserving data mining

      2016, Journal of Systems and Software
      Citation Excerpt :

      PPDM is research that can effectively protect privacy information while simultaneously preserving the knowledge in the original data (Fung and Mangasarian, 2013; Hajian et al., 2014; Lakshmi and Rani, 2013). Relevant literature in the past can be divided into three types (Sasikala and Banu, 2014): (1) before disclosing or providing the original data, use swap (Li et al., 2012; Yang and Qiao, 2010; Zhu et al., 2009), update (Fung et al., 2007; Mateo-Sanz et al, 2005; Yun and Kim, 2015), and other operations to disrupt the original data; (2) the original data are distributed among two or more sites; individual sites cannot know the content of the data contained in the other sites; (3) While using a classification model to classify the original data, only specific users know the classification results. Among them, the first type of approach is the most favored.

    • Pricing and disseminating customer data with privacy awareness

      2014, Decision Support Systems
      Citation Excerpt :

      This is available in the sH data; so the utility function with sH is concave and monotonic increasing, as described earlier. With the sL data, which does not have explicit identifiers such as name and phone number, the user has to use the quasi-identifier attributes, such as age, gender and zip code, to match the records in the data with those in an external source (e.g., voter registration records) to re-identify the individuals [27,30]. So, there is a cost involved for the type I user (but not for the type A user) in order to use the sL data.

    • Privacy protection challenges in statistical disclosure control

      2023, Contemporary Challenges for Cyber Security and Data Privacy
    • Contemporary challenges for cyber security and data privacy

      2023, Contemporary Challenges for Cyber Security and Data Privacy
    View all citing articles on Scopus

    Dan Zhu is an associate professor in the Department of Logistics, Operations and Management Information Systems at the Iowa State University. She obtained her Ph.D. degree from Carnegie Mellon University. Her current research focuses on developing and applying intelligent and learning technologies to business and management that lies in decision support systems, information security and privacy, and business intelligence. Dr. Zhu’s research has been published in the Decision Support Systems, Proceedings of National Academy of Sciences, Information System Research, Naval Research Logistics, Annals of Statistics, Annals of Operations Research, Decision Sciences, Omega, Journal of Databases, Journal of Electronic Commerce Research, International Journal of Knowledge Management, Journal of Information and Software Technology, etc.

    Xiao-Bai (Bob) Li is an associate professor of management information systems in the Ddepartment of Operations and Manufacturing/Management Information Systems at the University of Massachusetts Lowell Group. His research interests include data mining, information privacy, and information economicsdatabases, and privacy and security issues. His work has appeared or is forthcoming in Decision Support Systems, Information Systems Research, Operations Research, IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions on Systems, Man, and Cybernetics, IEEE Transactions on Automatic Control, Communications of the ACM, Decision Support Systems, INFORMS Journal on Computing, the European Journal of Operational Research, among others.

    Shuning Wu is a Senior Statistical Analyst at ISO Innovative Analytics. He holds a PhD degree in Industrial Engineering from Iowa State University. Dr. Wu’s research is focused on data mining, instance selection and metaheuristic optimization. He is now working on applying the advanced predictive modeling and data mining techniques to insurance industry. He has published in European Journal of Operational Research, Journal of Decision Support System, International Conference on Information Systems, Journal of Tsinghua University and so on.

    A preliminary version of this work was presented at the 28th International Conference on Information Systems (ICIS), Montreal, Canada, December 2007.

    View full text