Random-data perturbation techniques and privacy-preserving data mining

Kargupta, Hillol; Datta, Souptik; Wang, Qi; Sivakumar, Krishnamoorthy

doi:10.1007/s10115-004-0173-6

Random-data perturbation techniques and privacy-preserving data mining

Published: 01 May 2005

Volume 7, pages 387–414, (2005)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Hillol Kargupta¹,
Souptik Datta¹,
Qi Wang² &
…
Krishnamoorthy Sivakumar²

1004 Accesses
126 Citations
3 Altmetric
Explore all metrics

Abstract

Privacy is becoming an increasingly important issue in many data-mining applications. This has triggered the development of many privacy-preserving data-mining techniques. A large fraction of them use randomized data-distortion techniques to mask the data for preserving the privacy of sensitive data. This methodology attempts to hide the sensitive data by randomly modifying the data values often using additive noise. This paper questions the utility of the random-value distortion technique in privacy preservation. The paper first notes that random matrices have predictable structures in the spectral domain and then it develops a random matrix-based spectral-filtering technique to retrieve original data from the dataset distorted by adding random values. The proposed method works by comparing the spectrum generated from the observed data with that of random matrices. This paper presents the theoretical foundation and extensive experimental results to demonstrate that, in many cases, random-data distortion preserves very little data privacy. The analytical framework presented in this paper also points out several possible avenues for the development of new privacy-preserving data-mining techniques. Examples include algorithms that explicitly guard against privacy breaches through linear transformations, exploiting multiplicative and colored noise for preserving privacy in data mining applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Agrawal D, Aggawal CC (2001) On the design and quantification of privacy preserving data-mining algorothms. In: Proceedings of the 20th ACM SIMOD symposium on principles of database systems. Santa Barbara, pp 247–255
Agrawal R, Srikant R (2000) Privacy-preserving data mining. In: Proceeding of the ACM SIGMOD conference on management of data. ACM Press, Dallas, TX, pp 439–450
Bai ZD, Silverstein JW, Yin YQ (1988) A note on the largest eigenvalue of a large dimensional sample covariance matrix. J Multivar Anal 26(2):166–168
Article Google Scholar
Brand R (2002) Microdata protection through noise addition. In: Inference control in statistical databases from theory to practice. Springer, Berlin, Heidelberg, New York, pp 97–116
Du W, Atallah MJ (2001) Secure multi-party computation problems and their applications: a review and open problems. In: New security paradigms workshop, pp 11–20
Evfimevski A, Gehrke J, Srikant R (2003) Limiting privacy breaches in privacy preserving data mining. In: Proceedings of the ACM SIMOD/PODS conference. San Diego, CA
Evfimevski A, Srikant R, Agrawal R, Gehrke J (2002) Privacy preserving mining of association rules. In: Proceedings of the ACM SIKDD conference. Edmonton, Canada
Evfimevski S (2002) Randomization techniques for privacy preserving association rule mining. In: SIGKDD explorations, vol 4(2)
Geman S (1980) A limit theorem for the norm of random matrices. Ann Probabil 8:252–261
Article MathSciNet Google Scholar
Grenander U, Silverstein JW (1977) Spectral analysis of networks with random topologies. SIAM J Appl Math 32:499–519
Article MathSciNet Google Scholar
Ham F, Faour N, Wheeler J (1999) Infrasound signal separation using independent component analysis. In: 21st seismic research symposium
Iyengar VS (2002) Transforming data to satisfy privacy constraints. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 279–288
Jackson JE (1991) A users’ guide to principal components. Wiley
Janson S, Rucinski TL, Rucinski A (2000) Random graphs, 1st ed. Wiley
Johnson E, Kargupta H (nd) Collective, hierarchical clustering from distributed, heterogeneous data. Lecture notes in computer science
Jonsson D (1982) Some limit theorems for the eigenvalues of a sample covariance matrix. J Multivar Anal 12:1–38
Article Google Scholar
Kantarcioglu M, Clifton C (2002) Privacy-preserving distributed mining of association rules on horizontally partitioned data. In: SIGMOD workshop on DMKD. Madison, WI
Kargupta H, Park B, Hershberger D, Johnson E (2000) Collective data mining: a new perspective towards distributed data mining. In: Kargupta H, Chan P (eds) Advances in distributed and parallel knowledge discovery. AAAI/MIT Press
Kargupta H, Park H, Pittie S, Liu L, Kushraj D, Sarkar K (2001) MobiMine: monitoring the stock market from a PDA. ACM SIGKDD Explor 3:37–47
Article Google Scholar
Kargupta H, Sivakumar K, Ghosh S (2002) Dependency detection in mobimine and random matrices. In: Proceedings of the 6th European conference on principles and practice of knowledge discovery in databases. Springer, Berlin, Heidelberg, New York, pp 250–262
Liew CK, Choi UJ, Liew CJ (1985) A data distortion by probability distribution. ACM Trans Database Syst 10:395–411
Article Google Scholar
Lindell Y, Pinkas B (2000) Privacy preserving data mining. In: Advances in cryptology CRYPTO 2000, pp 36–54
Liu K, Kargupta H, Ryan J (2003) Random projection and privacy preserving correlation computation from distributed data. Technical report, University of Maryland, Baltimore County, Computer Science and Electrical Engineering Department. Technical report TR-CS-03-24
Manolakis DG, Ingle VK, Kogon SM (2000) Statistical and adaptive signal processing. McGraw Hill
Marcenko VA, Pastur LA (1967) Distribution of eigenvalues for some sets of random matrices. Math USSR—Sbornik 1:457–483
Mehta ML (1991) Random matrices, 2nd ed. Academic, London
Muralidhar K, Sarathy R (1999) Security of random data perturbation methods. ACM Trans Database Syst 24:487–493
Article Google Scholar
Papoulis A, Pillai SU (2002) Probability, random variables and stochastic processes, 4th ed. McGraw Hill, New York
Park B, Ayyagari R, Kargupta H (2001) A fourier analysis-based approach to learn classifier from distributed heterogeneous data. In: Proceedings of the first SIAM internation conference on data mining. Chicago, IL
Park BH, Kargupta H (2003) In: Nong Ye (ed) The handbook of data mining. Lawrence Erlbaum Associates Inc Publishers, p 341
Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106
Google Scholar
Repository, UML (nd) http://www.ics.uci.edu/ mlearn/mlsummary.html
Rizvi SJ, Haritsa JR (2002) Maintaining data privacy in association rule mining. In: Proceedings of the 28th VLDB conference. Hong Kong, China
Schneier B (1995). In: Applied cryptography. Wiley
Silverstein JW (1989) On the weak limit of the largest eigenvalue of a large dimensional sample covariance matrix. J Multivar Anal 30:307–311
Article MathSciNet Google Scholar
Silverstein JW, Combettes PL (1992) Signal detection via spectral theory of large dimensional random matrices. IEEE Trans Signal Process 40:2100–2105
Article Google Scholar
Stewart GW (1973) Error and perturbation bounds for subspaces associated with certain eigenvalue problems. SIAM Rev 15:727–764
Article MathSciNet Google Scholar
Stolfo S et al (1997) Jam: java agents for meta-learning over distributed databases. In: Proceedings of the third international conference on knowledge discovery and data mining. AAAI, Menlo Park, CA, pp 74–81
Traub JF, Yemini Y, Woz’niakowski H (1984) The statistical security of a statistical database. ACM Trans Database Syst 9:672–679
Article Google Scholar
Vaidya J, Clifton C (2002) Privacy preserving association rule mining in vertically partitioned data. In: The eighth ACM SIGKDD international conference on knowledge discovery and data mining. Edmonton, Alberta, CA
Weyl H (1949) Inequalities between the two kinds of eigenvalues of a linear transformation. In: Proceedings of the national academy of sciences, vol 35, pp 408–411
Wigner EP (1952) On the statistical distribution of the widths and spacings of nuclear resonance levels. In: Proceedings of the Cambridge Philosophical Society, vol 47, pp 790–798
Yin YQ, Bai ZD, Krishnaiah PR (1988) On the limit of the largest eigenvalue of the large dimensional sample covariance matrix. Probabil Theory Related Fields 78:509–521
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Baltimore, MD, 21250, USA
Hillol Kargupta & Souptik Datta
School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, USA
Qi Wang & Krishnamoorthy Sivakumar

Authors

Hillol Kargupta
View author publications
You can also search for this author in PubMed Google Scholar
Souptik Datta
View author publications
You can also search for this author in PubMed Google Scholar
Qi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Krishnamoorthy Sivakumar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hillol Kargupta.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kargupta, H., Datta, S., Wang, Q. et al. Random-data perturbation techniques and privacy-preserving data mining. Knowl Inf Syst 7, 387–414 (2005). https://doi.org/10.1007/s10115-004-0173-6

Download citation

Received: 19 November 2003
Revised: 05 January 2004
Accepted: 16 February 2004
Published: 01 May 2005
Issue Date: May 2005
DOI: https://doi.org/10.1007/s10115-004-0173-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Random-data perturbation techniques and privacy-preserving data mining

Abstract

Access this article

Similar content being viewed by others

Random Matrix Theory and Its Innovative Applications

Maximum entropy of random permutation set

No Outliers in the Spectrum of the Product of Independent Non-Hermitian Random Matrices with Independent Entries

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Random-data perturbation techniques and privacy-preserving data mining

Abstract

Access this article

Similar content being viewed by others

Random Matrix Theory and Its Innovative Applications

Maximum entropy of random permutation set

No Outliers in the Spectrum of the Product of Independent Non-Hermitian Random Matrices with Independent Entries

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation