Identity disclosure protection: A data reconstruction approach for privacy-preserving data mining

doi:10.1016/j.dss.2009.07.003

Decision Support Systems

Volume 48, Issue 1, December 2009, Pages 133-140

https://doi.org/10.1016/j.dss.2009.07.003 Get rights and content

Abstract

Identity disclosure is one of the most serious privacy concerns in today's information age. A well-known method for protecting identity disclosure is k-anonymity. A dataset provides k-anonymity protection if the information for each individual in the dataset cannot be distinguished from at least k − 1 individuals whose information also appears in the dataset. There is a flaw in k-anonymity that would still allow an intruder to discern the confidential information of individuals in the anonymized data. To overcome this problem, we propose a data reconstruction approach to achieve k-anonymity protection in predictive data mining. In this approach, the potentially identifying attributes are first masked using aggregation (for numeric data) and swapping (for nominal data). A genetic algorithm technique is then applied to the masked data to find a good subset of it. This subset is then replicated to form the released dataset that satisfies the k-anonymity constraint.

Introduction

Data-mining technologies have enabled organizations to extract useful knowledge from the data in order to better understand and serve their customers, and to gain competitive advantages [6], [21], [26]. While successful business applications of data mining are encouraging, there are increasing concerns about invasions to the privacy of personal information. A survey by Time/CNN [16] revealed that 93% of respondents believed companies selling personal data should be required to gain permission from the individuals whose information is being shared. In another study [9], more than 70% of participants responded negatively to questions related to the secondary use of private information. Concern about privacy threats has caused data quality and integrity to deteriorate. According to [34], 82% of online users have refused to give personal information and 34% have lied when asked about their personal habits and preferences.

This study deals with the conflict between privacy and data mining in organizational decision support. Organizations that use their customers' records in data-mining activities are obligated to take actions to protect the identities of the individuals involved. It has been demonstrated that personal identities cannot be adequately protected by simply removing identity attributes from released data. There has been extensive research in the area of statistical databases (SDBs) on how to protect individuals' sensitive data when providing summary statistical information. The privacy issue arises in SDBs when summary statistics are derived on very few individuals' data. In this case, releasing the summary statistics may result in disclosing confidential data. The methods for preventing such disclosure can be broadly classified into two categories: (i) query restriction, which prohibits queries that would reveal confidential data, and (ii) data perturbation, which alters individual data in a way such that the summary statistics remain approximately the same. In general, both methods have been extensively investigated and employed [1]. Problems in data mining are somewhat different from those in SDBs. A data-mining task, such as classification or numeric prediction, requires working on individual records contained in a dataset. As a result, query restriction is no longer applicable and data perturbation or anonymization becomes the primary approach for privacy protection in data mining. Further, predictive data mining essentially relies on discovering relationships between data attributes. Preserving such relationships may not be consistent with preserving summary statistics. Researchers in the data-mining community have proposed various methods to resolve the conflict between data mining and privacy protection [4], [7], [14], [22], [23]. For example, a method for building a decision tree classifier from perturbed data is proposed in [3]. A framework for mining association rules from transaction data that have been randomized is presented in [11]. A set of algorithms for hiding sensitive rules is proposed in [36]. Techniques for preserving privacy in distributed data mining are discussed in [8].

A well-known method for privacy protection, called k-anonymity, was proposed in [31], [33]. The basic idea is to anonymize the data such that each individual cannot be distinguished from a group of other individuals in the data. The method has gained increasing popularity in privacy-preserving data mining. However, the k-anonymity approach would, in some circumstances, still allow a data intruder to disclose the individual confidential information in the k-anonymized data. To overcome this problem, we propose a data reconstruction approach to achieve k-anonymity protection in predictive data mining. In this approach, the potentially identifying attributes are first masked using aggregation (for numeric data) and swapping (for nominal data), without considering the k-anonymity constraint. A genetic algorithm technique is then applied to the masked data to find a good subset of it. This subset is then replicated to form the released dataset that satisfies the k-anonymity constraint. An experimental study is conducted to show the effectiveness of the proposed method.

Section snippets

Identity and confidentiality disclosure problem

A common practice for protecting identity disclosure is to remove identity related attributes from released data. Sweeney [33] demonstrated that this is not adequate in protecting personal identities. In fact, the author showed that 87% of the population in the United States can be uniquely identified using three demographic attributes: gender, date of birth, and 5-digit zip code. These attributes are normally not considered identity attributes. However, since they can potentially be used to

The data reconstruction approach

This study deals with privacy protection problem in the context of predictive data mining. We focus our approach on classification analysis, which is a common data-mining task. The basic idea of our approach also applies to the other predictive data-mining tasks such as numerical prediction (regression). We do not, however, target unsupervised learning problems such as clustering and association rules mining (see [2], [13] for example studies in these areas). The objective of our approach is to

An illustrative example

In this section, we demonstrate our approach using the example data in Table 1(a). As mentioned earlier, Test Result is the confidential attribute in this dataset. Age and Marital Status are the QI attributes, and Blood Pressure and Blood Type are non-QI attributes. In k-anonymity, the QI attributes are masked while the other attributes are unchanged. To illustrate, let k = 3.

- Step 1. Aggregate numeric Age values into discretized values (labeled as Age2). As described in Section 3.1, the attribute

Computational experiments and results

A set of numerical experiments was conducted using two real-world datasets. Both datasets were taken from the Machine Learning Repository of the University of California at Irvine [17]. The characteristics of these two datasets are described in Table 7.

The first dataset, Diabetes, contains 768 instances of patient information, with nine numeric and nominal attributes, including diagnostic result, age, number of times pregnant, and a few lab test measures. Diagnostic result was considered as the

Conclusion and discussion

This paper presents a novel instance selection method based on genetic algorithm for identity disclosure protection. We introduce a data reconstruction approach to achieve k-anonymity protection in privacy-preserving data mining. The empirical evaluation results indicate that our proposed approach can lead to significantly improved performance. The insights gained from this study can help business make effective decisions on privacy protection in data mining.

Our work illustrates the usefulness

Acknowledgement

This research is partially supported by funds from the Information Infrastructure Institute (iCube), Center for Information Protection Center, and College of Business at Iowa State University. We would like to thank the editor and three anonymous reviewers for their detailed comments that help improve the paper.

References (37)

A. Amiri
Dare to share: protecting sensitive knowledge with data sanitization
Decision Support Systems
(2007)
H. Chen
Intelligence and security informatics: information systems perspective
Decision Support Systems
(2006)
X.-B. Li et al.
Adaptive data reduction for large-scale transaction data
European Journal of Operational Research
(2008)
D. Martens et al.
Predicting going concern opinion with data mining
Decision Support Systems
(2008)
T.S. Raghu et al.
Cyberinfrastructure for homeland security: advances in information sharing, data mining, and collaboration systems
Decision Support Systems
(2007)
J. Rees et al.
Leadership and group search in group decision support systems
Decision Support Systems
(2000)
N.R. Adam et al.
Security-control methods for statistical databases: a comparative study
ACM Computing Surveys
(1989)
G. Aggarwal et al.
Achieving anonymity via clustering
R. Agrawal et al.
Privacy-preserving data mining
J.R. Cano et al.
Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study
IEEE Transactions on Evolutionary Computation
(2003)

D.S. Chowdhury et al.

Disclosure detection in multivariate categorical databases: auditing confidentiality protection through two new matrix operators

Management Science

(1999)

C. Clifton et al.

Tools for privacy preserving distributed data mining

SIGKDD Explorations

(2002)

M. Culnan

How did they get my name?: an exploratory investigation of consumer attitudes toward secondary information use

MIS Quarterly

(1993)

L. Ding et al.

Some theoretical results about the computation time of evolutionary algorithms

A. Evfimievski et al.

Privacy preserving mining of association rules

U.M. Fayyad et al.

On the handling of continuous-valued attributes in decision tree generation

Machine Learning

(1992)

A. Friedman et al.

Providing k-anonymity in data mining

International Journal on Very Large Data Bases

(2008)

R. Garfinkel et al.

Privacy protection of binary confidential data against deterministic, stochastic, and insider threat

Management Science

(2002)

Cited by (38)

Getting value from Business Intelligence systems: A review and research agenda
2017, Decision Support Systems
Much of the research on Business Intelligence (BI) has examined the ability of BI systems to help organizations address challenges and opportunities. However, the literature is fragmented and lacks an overarching framework to integrate findings and systematically guide research. Moreover, researchers and practitioners continue to question the value of BI systems. This study reviews and synthesizes empirical Information System (IS) studies to learn what we know, how well we know, and what we need to know about the processes of organizations obtaining business value from BI systems. The study aims to identify which parts of the BI business value process have been studied and are still most in need of research, and to propose specific research questions for the future. The findings show that organizations appear to obtain value from BI systems according to the process suggested by Soh and Markus (1995), as a chain of necessary conditions from BI investments to BI assets to BI impacts to organizational performance; however, researchers have not sufficiently studied the probabilistic processes that link the necessary conditions together. Moreover, the research has not sufficiently covered all relevant levels of analysis, nor examined how the levels link up. Overall, the paper identified many opportunities for researchers to provide a more complete picture of how organizations can and do obtain value from BI.
A reversible data transform algorithm using integer transform for privacy-preserving data mining
2016, Journal of Systems and Software
Citation Excerpt :
PPDM is research that can effectively protect privacy information while simultaneously preserving the knowledge in the original data (Fung and Mangasarian, 2013; Hajian et al., 2014; Lakshmi and Rani, 2013). Relevant literature in the past can be divided into three types (Sasikala and Banu, 2014): (1) before disclosing or providing the original data, use swap (Li et al., 2012; Yang and Qiao, 2010; Zhu et al., 2009), update (Fung et al., 2007; Mateo-Sanz et al, 2005; Yun and Kim, 2015), and other operations to disrupt the original data; (2) the original data are distributed among two or more sites; individual sites cannot know the content of the data contained in the other sites; (3) While using a classification model to classify the original data, only specific users know the classification results. Among them, the first type of approach is the most favored.
In the cloud computing environment, since data owners worry about private information in their data being disclosed without permission, they try to retain the knowledge within the data, while applying privacy-preserving techniques to the data. In the past, a data perturbation approach was commonly used to modify the original data content, but it also results in data distortion, and hence leads to significant loss of knowledge within the data. To solve this problem, this study introduced the concept of reversible integer transformation in the image processing domain and developed a Reversible Data Transform (RDT) algorithm that can disrupt and restore data. In the RDT algorithm, using an adjustable weighting mechanism, the degree of data perturbation was adjusted to increase the flexibility of privacy-preserving. In addition, it allows the data to be embedded with a watermark, in order to identify whether the perturbed data has been tampered with. Experimental results show that, compared with the existing algorithms, RDT has better knowledge reservation and is better in terms of effectively reducing information loss and privacy disclosure risk. In addition, it has a high watermark payload.
Adaptive utility-based anonymization model: Performance evaluation on big data sets
2015, Procedia Computer Science
Data Anonymization is one of the globally accepted mechanisms for the protection of privacy of individuals in data publishing scenario. Normally the data anonymization impacts on the quality of data especially critical to the success of knowledge-based applications. An intelligent approach based on association mining namely, Adaptive Utility-based Anonymization (AUA) has been proposed in order to deal with this issue. Initially the model is tested with sample instances of original data set National Family Health Survey (NFHS-3) and this paper includes performance evaluation of AUA model using data sets and proves that the data anonymization can be done without compromising the quality of data mining results.
Pricing and disseminating customer data with privacy awareness
2014, Decision Support Systems
Citation Excerpt :
This is available in the sH data; so the utility function with sH is concave and monotonic increasing, as described earlier. With the sL data, which does not have explicit identifiers such as name and phone number, the user has to use the quasi-identifier attributes, such as age, gender and zip code, to match the records in the data with those in an external source (e.g., voter registration records) to re-identify the individuals [27,30]. So, there is a cost involved for the type I user (but not for the type A user) in order to use the sL data.
Organizations today regularly share their customer data with their partners to gain competitive advantages. They are also often requested or even required by a third party to provide customer data that are deemed sensitive. In these circumstances, organizations are obligated to protect the privacy of the individuals involved while still benefiting from sharing data or meeting the requirement for releasing data. In this study, we analyze the tradeoff between privacy and data utility from the perspective of the data owner. We develop an incentive-compatible mechanism for the data owner to price and disseminate private data. With this mechanism, a data user is motivated to reveal his true purpose of data usage and acquire the data that suits to that purpose. Existing economic studies of information privacy primarily consider the interplay between the data owner and the individuals, focusing on problems that occur in the collection of private data. This study, however, examines the privacy issue facing a data owner organization in the distribution of private data to a third party data user when the real purpose of data usage is unclear and the released data could be misused.
Privacy protection challenges in statistical disclosure control
2023, Contemporary Challenges for Cyber Security and Data Privacy
Contemporary challenges for cyber security and data privacy
2023, Contemporary Challenges for Cyber Security and Data Privacy

View all citing articles on Scopus

Dan Zhu is an associate professor in the Department of Logistics, Operations and Management Information Systems at the Iowa State University. She obtained her Ph.D. degree from Carnegie Mellon University. Her current research focuses on developing and applying intelligent and learning technologies to business and management that lies in decision support systems, information security and privacy, and business intelligence. Dr. Zhu’s research has been published in the Decision Support Systems, Proceedings of National Academy of Sciences, Information System Research, Naval Research Logistics, Annals of Statistics, Annals of Operations Research, Decision Sciences, Omega, Journal of Databases, Journal of Electronic Commerce Research, International Journal of Knowledge Management, Journal of Information and Software Technology, etc.

Xiao-Bai (Bob) Li is an associate professor of management information systems in the Ddepartment of Operations and Manufacturing/Management Information Systems at the University of Massachusetts Lowell Group. His research interests include data mining, information privacy, and information economicsdatabases, and privacy and security issues. His work has appeared or is forthcoming in Decision Support Systems, Information Systems Research, Operations Research, IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions on Systems, Man, and Cybernetics, IEEE Transactions on Automatic Control, Communications of the ACM, Decision Support Systems, INFORMS Journal on Computing, the European Journal of Operational Research, among others.

Shuning Wu is a Senior Statistical Analyst at ISO Innovative Analytics. He holds a PhD degree in Industrial Engineering from Iowa State University. Dr. Wu’s research is focused on data mining, instance selection and metaheuristic optimization. He is now working on applying the advanced predictive modeling and data mining techniques to insurance industry. He has published in European Journal of Operational Research, Journal of Decision Support System, International Conference on Information Systems, Journal of Tsinghua University and so on.

^☆: A preliminary version of this work was presented at the 28th International Conference on Information Systems (ICIS), Montreal, Canada, December 2007.

View full text

Identity disclosure protection: A data reconstruction approach for privacy-preserving data mining☆

Abstract

Introduction

Section snippets

Identity and confidentiality disclosure problem

The data reconstruction approach

An illustrative example

Computational experiments and results

Conclusion and discussion

Acknowledgement

Decision Support Systems

Decision Support Systems

European Journal of Operational Research

Decision Support Systems

Decision Support Systems

Decision Support Systems

Security-control methods for statistical databases: a comparative study

ACM Computing Surveys

Achieving anonymity via clustering

Privacy-preserving data mining

Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study

IEEE Transactions on Evolutionary Computation

Disclosure detection in multivariate categorical databases: auditing confidentiality protection through two new matrix operators

Management Science

Tools for privacy preserving distributed data mining

SIGKDD Explorations

How did they get my name?: an exploratory investigation of consumer attitudes toward secondary information use

MIS Quarterly

Some theoretical results about the computation time of evolutionary algorithms

Privacy preserving mining of association rules

On the handling of continuous-valued attributes in decision tree generation

Machine Learning

Providing k-anonymity in data mining

International Journal on Very Large Data Bases

Privacy protection of binary confidential data against deterministic, stochastic, and insider threat

Management Science