Privacy preserving data mining: A noise addition framework using a novel clustering technique
Introduction
During the whole process of data mining (from collection of data to discovery of knowledge) sensitive information contained in data sets may get exposed to several parties. Disclosure of such sensitive information is considered as breach of individual privacy. Consequently, there is a huge public awareness and concern about privacy. Many recent surveys illustrate this concern [1]. Public awareness of privacy and lack of public trust in organizations may introduce additional complexity to data collection. As a result, organizations may not be able to fully enjoy the benefits of data mining.
Therefore, many privacy preserving data mining techniques have been proposed. For example, a group of privacy preserving techniques produces synthetic data from an original data set, and instead of the original data set it releases the synthetic data set that maintains some characteristics of the original data set [2], [3], [4]. Another group of techniques adds noise to a data set in order to preserve individual privacy in the perturbed data set while allowing a data miner to produce a high quality decision tree from the perturbed data set [5], [6], [7]. Many techniques have also been proposed for privacy preservation in association rule mining [8], [9], [10], [11], [12], [13], [14], [15], [16], [17].
In this paper we present a framework for adding noise to all attributes (both numerical and categorical) in two steps; in the first step following a data swapping technique [6], [18] we add noise to sensitive class attribute values, which are also known as labels. Additionally, in the next step we add noise to all non-class attributes to prevent re-identification of a record with high certainty and disclosure of a sensitive class value. Noise addition to non-class attributes also protect the attributes from being disclosed. The main goal of our noise addition techniques is to provide high level of security while preserving a good data quality.
The organization of the paper is as follows. Section 2 presents DETECTIVE, a novel technique for clustering categorical values. In Section 3, a few novel techniques are presented for noise addition to numerical and categorical attributes of a data set. Section 4 presents our framework that combines the noise addition techniques in order to perturb all attributes of a data set. We also present an Extended Framework that incorporates an existing technique called GADP [3] or one of its variants such as CGADP [19] or EGADP [20]. Experimental results are presented in Section 5. In Section 6 we present a security analysis. Section 7 gives concluding remarks.
Section snippets
Detective: a novel categorical values clustering technique
Various relationships between different classifier (non-class) attributes and the class attribute of a data set are discovered in the form of logic rules (patterns) by a decision tree. Fig. 1 is an example of a decision tree. The tree has four leaves. Leaf 1, Leaf 2 and Leaf 4 are heterogenous leaves since the records belonging to each of the leaves have different class values. Records belonging to a heterogenous leaf have different class values. We refer to the path from the root to a leaf as
Class attribute perturbation techniques
We use the following notations to explain our class attribute perturbation techniques [24], [25]. Let H be the number of heterogeneous leaves, mk be the number of majority records in the kth heterogeneous leaf where 1 ⩽ k ⩽ H, nk be the number of minority records (i.e. the records not having the class values same as the majority class value) in the kth heterogeneous leaf where 1 ⩽ k ⩽ H, and E(N) be the expected number of changed class values.
In Random Perturbation Technique (RPT), the class values of
The Framework
We now present a high level pseudocode of our framework [26] for adding noise to all attributes. A decision tree is first built from an original data set, and then used to perturb the data set as follows.
For each leaf, DO:
- Step 1:
Add noise to numerical Leaf Influential Attributes (LINFAs) of the original records belonging to the leaf by Leaf Influential Attribute Perturbation Technique (LINFAPT). Thereby produce a set of perturbed records ps1.
- Step 2:
Add noise to numerical Leaf Innocent Attributes (LINNAs) of
Experimental results
In this section we first present experimental results on DETECTIVE and CAPT on a synthetic data set. We then present experimental results on our Framework and Extended Framework using two real data sets. We compare our frameworks with GADP [3] and random noise addition approaches.
We now introduce a few terms that we use throughout this section. The decision tree obtained from an original training data set is called original tree. The rules belonging to the original tree are called original
Security analysis
Due to the varying definitions of a disclosure it is not trivial to measure a disclosure risk. Moreover, a disclosure risk depends on various other factors such as supplementary knowledge of an intruder and the approach taken by an intruder. However, effective measuring of disclosure risk is important as the effectiveness of a data perturbation technique is evaluated by the disclosure risk and the data quality of a perturbed data set.
Before we introduce our approach to measuring the disclosure
Conclusion
In this paper we have introduced a framework, that incorporates several novel techniques to perturb all attributes of a data set. Our experimental results indicate that the framework is very effective in preserving original patterns in a perturbed data set. The trees obtained from data sets perturbed by the framework are very similar to the original tree. For Adult data set, four out of five perturbed trees have 100% Type A or Type B rules. The fifth tree has 70% Type A or Type B rules and 0%
Acknowledgement
This work was supported by the ARC Grant No. DG-DP0452182, and Seed Grant, Faculty of Business, Charles Sturt University, Australia.
References (30)
- W.C.G.P. Ltd., Community Attitude towards Privacy 2007, A Survey Prepared for the Office of the Federal Privacy...
- Y. Zhu, L. Liu, Optimal randomization for privacy preserving data mining, in: Proceedings of the Tenth ACM SIGKDD...
- et al.
A general additive data perturbation method for database security
Management Science
(1999) - et al.
Perturbing non-normal confidential attributes: the copula approach
Management Science
(2002) - et al.
Privacy-preserving data mining
- V. Estivill-Castro, L. Brankovic, Data swapping: balancing privacy against precision in mining for logic rules, in:...
- W. Du, Z. Zhan, Using randomized response techniques for privacy-preserving data mining, in: Proceedings of the Ninth...
- S.R.M. Oliveira, O.R. Zaïane, Algorithms for balancing privacy and knowledge discovery in association rule mining, in:...
- et al.
Association rule hiding
IEEE Trans. Knowl. Data Eng.
(2004) - S. Rizvi, J.R. Haritsa, Maintaining data privacy in association rule mining, in: Proceedings of the 28th VLDB...
Cited by (87)
Hiding sensitive information in eHealth datasets
2021, Future Generation Computer SystemsImproving consensus clustering with noise-induced ensemble generation
2020, Expert Systems with ApplicationsScore-VAE: Root Cause Analysis for Federated-Learning-Based IoT Anomaly Detection
2024, IEEE Internet of Things JournalA Generic Data Synthesis Framework for Privacy-Preserving Point-of-Interest Recommender Systems
2023, 2023 Research in Adaptive and Convergent Systems RACS 2023Data Mining Techniques for Cloud Privacy Preservation
2023, International Journal of Intelligent Systems and Applications in EngineeringAn Adaptive Privacy Preserving Framework for Distributed Association Rule Mining in Healthcare Databases
2023, Computers, Materials and Continua