Privacy preserving data mining: A noise addition framework using a novel clustering technique

doi:10.1016/j.knosys.2011.05.011

Knowledge-Based Systems

Volume 24, Issue 8, December 2011, Pages 1214-1223

https://doi.org/10.1016/j.knosys.2011.05.011 Get rights and content

Abstract

During the whole process of data mining (from data collection to knowledge discovery) various sensitive data get exposed to several parties including data collectors, cleaners, preprocessors, miners and decision makers. The exposure of sensitive data can potentially lead to breach of individual privacy. Therefore, many privacy preserving techniques have been proposed recently. In this paper we present a framework that uses a few novel noise addition techniques for protecting individual privacy while maintaining a high data quality. We add noise to all attributes, both numerical and categorical. We present a novel technique for clustering categorical values and use it for noise addition purpose. A security analysis is also presented for measuring the security level of a data set.

Introduction

During the whole process of data mining (from collection of data to discovery of knowledge) sensitive information contained in data sets may get exposed to several parties. Disclosure of such sensitive information is considered as breach of individual privacy. Consequently, there is a huge public awareness and concern about privacy. Many recent surveys illustrate this concern [1]. Public awareness of privacy and lack of public trust in organizations may introduce additional complexity to data collection. As a result, organizations may not be able to fully enjoy the benefits of data mining.

Therefore, many privacy preserving data mining techniques have been proposed. For example, a group of privacy preserving techniques produces synthetic data from an original data set, and instead of the original data set it releases the synthetic data set that maintains some characteristics of the original data set [2], [3], [4]. Another group of techniques adds noise to a data set in order to preserve individual privacy in the perturbed data set while allowing a data miner to produce a high quality decision tree from the perturbed data set [5], [6], [7]. Many techniques have also been proposed for privacy preservation in association rule mining [8], [9], [10], [11], [12], [13], [14], [15], [16], [17].

In this paper we present a framework for adding noise to all attributes (both numerical and categorical) in two steps; in the first step following a data swapping technique [6], [18] we add noise to sensitive class attribute values, which are also known as labels. Additionally, in the next step we add noise to all non-class attributes to prevent re-identification of a record with high certainty and disclosure of a sensitive class value. Noise addition to non-class attributes also protect the attributes from being disclosed. The main goal of our noise addition techniques is to provide high level of security while preserving a good data quality.

The organization of the paper is as follows. Section 2 presents DETECTIVE, a novel technique for clustering categorical values. In Section 3, a few novel techniques are presented for noise addition to numerical and categorical attributes of a data set. Section 4 presents our framework that combines the noise addition techniques in order to perturb all attributes of a data set. We also present an Extended Framework that incorporates an existing technique called GADP [3] or one of its variants such as CGADP [19] or EGADP [20]. Experimental results are presented in Section 5. In Section 6 we present a security analysis. Section 7 gives concluding remarks.

Section snippets

Detective: a novel categorical values clustering technique

Various relationships between different classifier (non-class) attributes and the class attribute of a data set are discovered in the form of logic rules (patterns) by a decision tree. Fig. 1 is an example of a decision tree. The tree has four leaves. Leaf 1, Leaf 2 and Leaf 4 are heterogenous leaves since the records belonging to each of the leaves have different class values. Records belonging to a heterogenous leaf have different class values. We refer to the path from the root to a leaf as

Class attribute perturbation techniques

We use the following notations to explain our class attribute perturbation techniques [24], [25]. Let H be the number of heterogeneous leaves, m_k be the number of majority records in the kth heterogeneous leaf where 1 ⩽ k ⩽ H, n_k be the number of minority records (i.e. the records not having the class values same as the majority class value) in the kth heterogeneous leaf where 1 ⩽ k ⩽ H, and E(N) be the expected number of changed class values.

In Random Perturbation Technique (RPT), the class values of

The Framework

We now present a high level pseudocode of our framework [26] for adding noise to all attributes. A decision tree is first built from an original data set, and then used to perturb the data set as follows.

For each leaf, DO:
Step 1:
Add noise to numerical Leaf Influential Attributes (LINFAs) of the original records belonging to the leaf by Leaf Influential Attribute Perturbation Technique (LINFAPT). Thereby produce a set of perturbed records p_s1.
Step 2:
Add noise to numerical Leaf Innocent Attributes (LINNAs) of

Experimental results

In this section we first present experimental results on DETECTIVE and CAPT on a synthetic data set. We then present experimental results on our Framework and Extended Framework using two real data sets. We compare our frameworks with GADP [3] and random noise addition approaches.

We now introduce a few terms that we use throughout this section. The decision tree obtained from an original training data set is called original tree. The rules belonging to the original tree are called original

Security analysis

Due to the varying definitions of a disclosure it is not trivial to measure a disclosure risk. Moreover, a disclosure risk depends on various other factors such as supplementary knowledge of an intruder and the approach taken by an intruder. However, effective measuring of disclosure risk is important as the effectiveness of a data perturbation technique is evaluated by the disclosure risk and the data quality of a perturbed data set.

Before we introduce our approach to measuring the disclosure

Conclusion

In this paper we have introduced a framework, that incorporates several novel techniques to perturb all attributes of a data set. Our experimental results indicate that the framework is very effective in preserving original patterns in a perturbed data set. The trees obtained from data sets perturbed by the framework are very similar to the original tree. For Adult data set, four out of five perturbed trees have 100% Type A or Type B rules. The fifth tree has 70% Type A or Type B rules and 0%

Acknowledgement

This work was supported by the ARC Grant No. DG-DP0452182, and Seed Grant, Faculty of Business, Charles Sturt University, Australia.

References (30)

W.C.G.P. Ltd., Community Attitude towards Privacy 2007, A Survey Prepared for the Office of the Federal Privacy...
Y. Zhu, L. Liu, Optimal randomization for privacy preserving data mining, in: Proceedings of the Tenth ACM SIGKDD...
K. Muralidhar et al.
A general additive data perturbation method for database security
Management Science
(1999)
R. Sarathy et al.
Perturbing non-normal confidential attributes: the copula approach
Management Science
(2002)
R. Agrawal et al.
Privacy-preserving data mining
V. Estivill-Castro, L. Brankovic, Data swapping: balancing privacy against precision in mining for logic rules, in:...
W. Du, Z. Zhan, Using randomized response techniques for privacy-preserving data mining, in: Proceedings of the Ninth...
S.R.M. Oliveira, O.R. Zaïane, Algorithms for balancing privacy and knowledge discovery in association rule mining, in:...
V.S. Verykios et al.
Association rule hiding
IEEE Trans. Knowl. Data Eng.
(2004)
S. Rizvi, J.R. Haritsa, Maintaining data privacy in association rule mining, in: Proceedings of the 28th VLDB...

S.R.M. Oliveira, O.R. Zaïane, Protecting sensitive knowledge by data sanitization, in: Proceedings of the 3rd IEEE...

S.R.M. Oliveira, O.R. Zaïane, Foundations for an access control model for privacy preservation in multi-relational...

S.R.M. Oliveira, O.R. Zaïane, Privacy preserving frequent itemset mining, in: Proceedings of IEEE ICDM Workshop on...

N. Zhang, S. Wang, W. Zhao, A new scheme on privacy preserving association rule mining, in: Knowledge Discovery in...

A. Schuster, R. Wolff, B. Gilburd, Privacy-preserving association rule mining in large-scale distributed systems, in:...

Cited by (87)

Hiding sensitive information in eHealth datasets
2021, Future Generation Computer Systems
Privacy in the realm of data mining known as PPDM has become a hot topic in both academic research and industry due to the fact it can discover implicit rules as well as hide sensitive information for data sanitization. Many different algorithms and heuristics have been investigated to hide sensitive information using the act of transaction deletion based on evolutionary computation techniques, but to date, these algorithms only consider a uniform threshold value for sanitization progress. This technique is not applicable in real-world situations, especially for eHealth based medical datasets. For example, a patient can still be identified if he/she has more confidential information (i.e., symptoms) that cause privacy threats and security leakage in medical applications. In this work, we investigate a unique novel methodology to set varied threshold values that lead to varied lengths of sensitive patterns within a Genetic Algorithm (GA)-based framework. As the pattern length increases, a tighter threshold manifests to provide better protection of sensitive information that can avoid individual patients to be identified in eHealth datasets. Two GA-based models are developed for data sanitization using record deletion techniques. The experimental results are conducted and compared with the traditional Evolutionary Computation (EC)-based PPDM approaches and the results showed that the designed methods offer greater protection than previous methods in terms of side effects. Therefore, the designed models are effective to hide sensitive information in medical situations that can be used in real-world scenarios.
Improving consensus clustering with noise-induced ensemble generation
2020, Expert Systems with Applications
Because of the negative perception towards noise, it is commonly eliminated in the process of data cleansing prior to the analysis process. Some studies attempt to employ tolerant or robust algorithms to achieve a reliable outcome. One way or another, the impact of noise might be minimized, thus preserving the integrity of discovered knowledge. On the other hand, making good use of noise has recently been investigated and exploited in different contexts, such as in privacy-preserving data mining, single clustering and consensus clustering. Given our initial study of employing uniform random noise in the process of ensemble generation as a way to increase diversity within an ensemble, improved clustering goodness can be obtained at specific levels of noise. To consolidate the aforementioned finding, this paper investigates a rich collection of random noise functions, which can be used to form perturbed data variation within the framework of noise-induced ensemble generation. The effectiveness of this approach which uses different cases for random noise is demonstrated over benchmark datasets from the UCI repository. The results suggest that the noise-induced strategy is generally better than the baseline counterpart, whilst showing uneven improvement with different data patterns. As such, a guideline is provided to make the best use of the proposed method with any new set of data.
Score-VAE: Root Cause Analysis for Federated-Learning-Based IoT Anomaly Detection
2024, IEEE Internet of Things Journal
A Generic Data Synthesis Framework for Privacy-Preserving Point-of-Interest Recommender Systems
2023, 2023 Research in Adaptive and Convergent Systems RACS 2023
Data Mining Techniques for Cloud Privacy Preservation
2023, International Journal of Intelligent Systems and Applications in Engineering
An Adaptive Privacy Preserving Framework for Distributed Association Rule Mining in Healthcare Databases
2023, Computers, Materials and Continua

View all citing articles on Scopus

View full text

Privacy preserving data mining: A noise addition framework using a novel clustering technique

Abstract

Introduction

Section snippets

Detective: a novel categorical values clustering technique

Class attribute perturbation techniques

The Framework

Experimental results

Security analysis

Conclusion

Acknowledgement

A general additive data perturbation method for database security

Management Science

Perturbing non-normal confidential attributes: the copula approach

Management Science

Privacy-preserving data mining

Association rule hiding

IEEE Trans. Knowl. Data Eng.