Identity disclosure protection: A data reconstruction approach for privacy-preserving data mining☆
Introduction
Data-mining technologies have enabled organizations to extract useful knowledge from the data in order to better understand and serve their customers, and to gain competitive advantages [6], [21], [26]. While successful business applications of data mining are encouraging, there are increasing concerns about invasions to the privacy of personal information. A survey by Time/CNN [16] revealed that 93% of respondents believed companies selling personal data should be required to gain permission from the individuals whose information is being shared. In another study [9], more than 70% of participants responded negatively to questions related to the secondary use of private information. Concern about privacy threats has caused data quality and integrity to deteriorate. According to [34], 82% of online users have refused to give personal information and 34% have lied when asked about their personal habits and preferences.
This study deals with the conflict between privacy and data mining in organizational decision support. Organizations that use their customers' records in data-mining activities are obligated to take actions to protect the identities of the individuals involved. It has been demonstrated that personal identities cannot be adequately protected by simply removing identity attributes from released data. There has been extensive research in the area of statistical databases (SDBs) on how to protect individuals' sensitive data when providing summary statistical information. The privacy issue arises in SDBs when summary statistics are derived on very few individuals' data. In this case, releasing the summary statistics may result in disclosing confidential data. The methods for preventing such disclosure can be broadly classified into two categories: (i) query restriction, which prohibits queries that would reveal confidential data, and (ii) data perturbation, which alters individual data in a way such that the summary statistics remain approximately the same. In general, both methods have been extensively investigated and employed [1]. Problems in data mining are somewhat different from those in SDBs. A data-mining task, such as classification or numeric prediction, requires working on individual records contained in a dataset. As a result, query restriction is no longer applicable and data perturbation or anonymization becomes the primary approach for privacy protection in data mining. Further, predictive data mining essentially relies on discovering relationships between data attributes. Preserving such relationships may not be consistent with preserving summary statistics. Researchers in the data-mining community have proposed various methods to resolve the conflict between data mining and privacy protection [4], [7], [14], [22], [23]. For example, a method for building a decision tree classifier from perturbed data is proposed in [3]. A framework for mining association rules from transaction data that have been randomized is presented in [11]. A set of algorithms for hiding sensitive rules is proposed in [36]. Techniques for preserving privacy in distributed data mining are discussed in [8].
A well-known method for privacy protection, called k-anonymity, was proposed in [31], [33]. The basic idea is to anonymize the data such that each individual cannot be distinguished from a group of other individuals in the data. The method has gained increasing popularity in privacy-preserving data mining. However, the k-anonymity approach would, in some circumstances, still allow a data intruder to disclose the individual confidential information in the k-anonymized data. To overcome this problem, we propose a data reconstruction approach to achieve k-anonymity protection in predictive data mining. In this approach, the potentially identifying attributes are first masked using aggregation (for numeric data) and swapping (for nominal data), without considering the k-anonymity constraint. A genetic algorithm technique is then applied to the masked data to find a good subset of it. This subset is then replicated to form the released dataset that satisfies the k-anonymity constraint. An experimental study is conducted to show the effectiveness of the proposed method.
Section snippets
Identity and confidentiality disclosure problem
A common practice for protecting identity disclosure is to remove identity related attributes from released data. Sweeney [33] demonstrated that this is not adequate in protecting personal identities. In fact, the author showed that 87% of the population in the United States can be uniquely identified using three demographic attributes: gender, date of birth, and 5-digit zip code. These attributes are normally not considered identity attributes. However, since they can potentially be used to
The data reconstruction approach
This study deals with privacy protection problem in the context of predictive data mining. We focus our approach on classification analysis, which is a common data-mining task. The basic idea of our approach also applies to the other predictive data-mining tasks such as numerical prediction (regression). We do not, however, target unsupervised learning problems such as clustering and association rules mining (see [2], [13] for example studies in these areas). The objective of our approach is to
An illustrative example
In this section, we demonstrate our approach using the example data in Table 1(a). As mentioned earlier, Test Result is the confidential attribute in this dataset. Age and Marital Status are the QI attributes, and Blood Pressure and Blood Type are non-QI attributes. In k-anonymity, the QI attributes are masked while the other attributes are unchanged. To illustrate, let k = 3.
Step 1. Aggregate numeric Age values into discretized values (labeled as Age2). As described in Section 3.1, the attribute
Computational experiments and results
A set of numerical experiments was conducted using two real-world datasets. Both datasets were taken from the Machine Learning Repository of the University of California at Irvine [17]. The characteristics of these two datasets are described in Table 7.
The first dataset, Diabetes, contains 768 instances of patient information, with nine numeric and nominal attributes, including diagnostic result, age, number of times pregnant, and a few lab test measures. Diagnostic result was considered as the
Conclusion and discussion
This paper presents a novel instance selection method based on genetic algorithm for identity disclosure protection. We introduce a data reconstruction approach to achieve k-anonymity protection in privacy-preserving data mining. The empirical evaluation results indicate that our proposed approach can lead to significantly improved performance. The insights gained from this study can help business make effective decisions on privacy protection in data mining.
Our work illustrates the usefulness
Acknowledgement
This research is partially supported by funds from the Information Infrastructure Institute (iCube), Center for Information Protection Center, and College of Business at Iowa State University. We would like to thank the editor and three anonymous reviewers for their detailed comments that help improve the paper.
Dan Zhu is an associate professor in the Department of Logistics, Operations and Management Information Systems at the Iowa State University. She obtained her Ph.D. degree from Carnegie Mellon University. Her current research focuses on developing and applying intelligent and learning technologies to business and management that lies in decision support systems, information security and privacy, and business intelligence. Dr. Zhu’s research has been published in the Decision Support Systems,
References (37)
Dare to share: protecting sensitive knowledge with data sanitization
Decision Support Systems
(2007)Intelligence and security informatics: information systems perspective
Decision Support Systems
(2006)- et al.
Adaptive data reduction for large-scale transaction data
European Journal of Operational Research
(2008) - et al.
Predicting going concern opinion with data mining
Decision Support Systems
(2008) - et al.
Cyberinfrastructure for homeland security: advances in information sharing, data mining, and collaboration systems
Decision Support Systems
(2007) - et al.
Leadership and group search in group decision support systems
Decision Support Systems
(2000) - et al.
Security-control methods for statistical databases: a comparative study
ACM Computing Surveys
(1989) - et al.
Achieving anonymity via clustering
- et al.
Privacy-preserving data mining
- et al.
Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study
IEEE Transactions on Evolutionary Computation
(2003)
Disclosure detection in multivariate categorical databases: auditing confidentiality protection through two new matrix operators
Management Science
Tools for privacy preserving distributed data mining
SIGKDD Explorations
How did they get my name?: an exploratory investigation of consumer attitudes toward secondary information use
MIS Quarterly
Some theoretical results about the computation time of evolutionary algorithms
Privacy preserving mining of association rules
On the handling of continuous-valued attributes in decision tree generation
Machine Learning
Providing k-anonymity in data mining
International Journal on Very Large Data Bases
Privacy protection of binary confidential data against deterministic, stochastic, and insider threat
Management Science
Cited by (38)
Getting value from Business Intelligence systems: A review and research agenda
2017, Decision Support SystemsA reversible data transform algorithm using integer transform for privacy-preserving data mining
2016, Journal of Systems and SoftwareCitation Excerpt :PPDM is research that can effectively protect privacy information while simultaneously preserving the knowledge in the original data (Fung and Mangasarian, 2013; Hajian et al., 2014; Lakshmi and Rani, 2013). Relevant literature in the past can be divided into three types (Sasikala and Banu, 2014): (1) before disclosing or providing the original data, use swap (Li et al., 2012; Yang and Qiao, 2010; Zhu et al., 2009), update (Fung et al., 2007; Mateo-Sanz et al, 2005; Yun and Kim, 2015), and other operations to disrupt the original data; (2) the original data are distributed among two or more sites; individual sites cannot know the content of the data contained in the other sites; (3) While using a classification model to classify the original data, only specific users know the classification results. Among them, the first type of approach is the most favored.
Adaptive utility-based anonymization model: Performance evaluation on big data sets
2015, Procedia Computer SciencePricing and disseminating customer data with privacy awareness
2014, Decision Support SystemsCitation Excerpt :This is available in the sH data; so the utility function with sH is concave and monotonic increasing, as described earlier. With the sL data, which does not have explicit identifiers such as name and phone number, the user has to use the quasi-identifier attributes, such as age, gender and zip code, to match the records in the data with those in an external source (e.g., voter registration records) to re-identify the individuals [27,30]. So, there is a cost involved for the type I user (but not for the type A user) in order to use the sL data.
Privacy protection challenges in statistical disclosure control
2023, Contemporary Challenges for Cyber Security and Data PrivacyContemporary challenges for cyber security and data privacy
2023, Contemporary Challenges for Cyber Security and Data Privacy
Dan Zhu is an associate professor in the Department of Logistics, Operations and Management Information Systems at the Iowa State University. She obtained her Ph.D. degree from Carnegie Mellon University. Her current research focuses on developing and applying intelligent and learning technologies to business and management that lies in decision support systems, information security and privacy, and business intelligence. Dr. Zhu’s research has been published in the Decision Support Systems, Proceedings of National Academy of Sciences, Information System Research, Naval Research Logistics, Annals of Statistics, Annals of Operations Research, Decision Sciences, Omega, Journal of Databases, Journal of Electronic Commerce Research, International Journal of Knowledge Management, Journal of Information and Software Technology, etc.
Xiao-Bai (Bob) Li is an associate professor of management information systems in the Ddepartment of Operations and Manufacturing/Management Information Systems at the University of Massachusetts Lowell Group. His research interests include data mining, information privacy, and information economicsdatabases, and privacy and security issues. His work has appeared or is forthcoming in Decision Support Systems, Information Systems Research, Operations Research, IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions on Systems, Man, and Cybernetics, IEEE Transactions on Automatic Control, Communications of the ACM, Decision Support Systems, INFORMS Journal on Computing, the European Journal of Operational Research, among others.
Shuning Wu is a Senior Statistical Analyst at ISO Innovative Analytics. He holds a PhD degree in Industrial Engineering from Iowa State University. Dr. Wu’s research is focused on data mining, instance selection and metaheuristic optimization. He is now working on applying the advanced predictive modeling and data mining techniques to insurance industry. He has published in European Journal of Operational Research, Journal of Decision Support System, International Conference on Information Systems, Journal of Tsinghua University and so on.
- ☆
A preliminary version of this work was presented at the 28th International Conference on Information Systems (ICIS), Montreal, Canada, December 2007.