Skip to main content
Log in

Random-data perturbation techniques and privacy-preserving data mining

  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Privacy is becoming an increasingly important issue in many data-mining applications. This has triggered the development of many privacy-preserving data-mining techniques. A large fraction of them use randomized data-distortion techniques to mask the data for preserving the privacy of sensitive data. This methodology attempts to hide the sensitive data by randomly modifying the data values often using additive noise. This paper questions the utility of the random-value distortion technique in privacy preservation. The paper first notes that random matrices have predictable structures in the spectral domain and then it develops a random matrix-based spectral-filtering technique to retrieve original data from the dataset distorted by adding random values. The proposed method works by comparing the spectrum generated from the observed data with that of random matrices. This paper presents the theoretical foundation and extensive experimental results to demonstrate that, in many cases, random-data distortion preserves very little data privacy. The analytical framework presented in this paper also points out several possible avenues for the development of new privacy-preserving data-mining techniques. Examples include algorithms that explicitly guard against privacy breaches through linear transformations, exploiting multiplicative and colored noise for preserving privacy in data mining applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Agrawal D, Aggawal CC (2001) On the design and quantification of privacy preserving data-mining algorothms. In: Proceedings of the 20th ACM SIMOD symposium on principles of database systems. Santa Barbara, pp 247–255

  • Agrawal R, Srikant R (2000) Privacy-preserving data mining. In: Proceeding of the ACM SIGMOD conference on management of data. ACM Press, Dallas, TX, pp 439–450

  • Bai ZD, Silverstein JW, Yin YQ (1988) A note on the largest eigenvalue of a large dimensional sample covariance matrix. J Multivar Anal 26(2):166–168

    Article  Google Scholar 

  • Brand R (2002) Microdata protection through noise addition. In: Inference control in statistical databases from theory to practice. Springer, Berlin, Heidelberg, New York, pp 97–116

  • Du W, Atallah MJ (2001) Secure multi-party computation problems and their applications: a review and open problems. In: New security paradigms workshop, pp 11–20

  • Evfimevski A, Gehrke J, Srikant R (2003) Limiting privacy breaches in privacy preserving data mining. In: Proceedings of the ACM SIMOD/PODS conference. San Diego, CA

  • Evfimevski A, Srikant R, Agrawal R, Gehrke J (2002) Privacy preserving mining of association rules. In: Proceedings of the ACM SIKDD conference. Edmonton, Canada

  • Evfimevski S (2002) Randomization techniques for privacy preserving association rule mining. In: SIGKDD explorations, vol 4(2)

  • Geman S (1980) A limit theorem for the norm of random matrices. Ann Probabil 8:252–261

    Article  MathSciNet  Google Scholar 

  • Grenander U, Silverstein JW (1977) Spectral analysis of networks with random topologies. SIAM J Appl Math 32:499–519

    Article  MathSciNet  Google Scholar 

  • Ham F, Faour N, Wheeler J (1999) Infrasound signal separation using independent component analysis. In: 21st seismic research symposium

  • Iyengar VS (2002) Transforming data to satisfy privacy constraints. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 279–288

  • Jackson JE (1991) A users’ guide to principal components. Wiley

  • Janson S, Rucinski TL, Rucinski A (2000) Random graphs, 1st ed. Wiley

  • Johnson E, Kargupta H (nd) Collective, hierarchical clustering from distributed, heterogeneous data. Lecture notes in computer science

  • Jonsson D (1982) Some limit theorems for the eigenvalues of a sample covariance matrix. J Multivar Anal 12:1–38

    Article  Google Scholar 

  • Kantarcioglu M, Clifton C (2002) Privacy-preserving distributed mining of association rules on horizontally partitioned data. In: SIGMOD workshop on DMKD. Madison, WI

  • Kargupta H, Park B, Hershberger D, Johnson E (2000) Collective data mining: a new perspective towards distributed data mining. In: Kargupta H, Chan P (eds) Advances in distributed and parallel knowledge discovery. AAAI/MIT Press

  • Kargupta H, Park H, Pittie S, Liu L, Kushraj D, Sarkar K (2001) MobiMine: monitoring the stock market from a PDA. ACM SIGKDD Explor 3:37–47

    Article  Google Scholar 

  • Kargupta H, Sivakumar K, Ghosh S (2002) Dependency detection in mobimine and random matrices. In: Proceedings of the 6th European conference on principles and practice of knowledge discovery in databases. Springer, Berlin, Heidelberg, New York, pp 250–262

  • Liew CK, Choi UJ, Liew CJ (1985) A data distortion by probability distribution. ACM Trans Database Syst 10:395–411

    Article  Google Scholar 

  • Lindell Y, Pinkas B (2000) Privacy preserving data mining. In: Advances in cryptology CRYPTO 2000, pp 36–54

  • Liu K, Kargupta H, Ryan J (2003) Random projection and privacy preserving correlation computation from distributed data. Technical report, University of Maryland, Baltimore County, Computer Science and Electrical Engineering Department. Technical report TR-CS-03-24

  • Manolakis DG, Ingle VK, Kogon SM (2000) Statistical and adaptive signal processing. McGraw Hill

  • Marcenko VA, Pastur LA (1967) Distribution of eigenvalues for some sets of random matrices. Math USSR—Sbornik 1:457–483

  • Mehta ML (1991) Random matrices, 2nd ed. Academic, London

  • Muralidhar K, Sarathy R (1999) Security of random data perturbation methods. ACM Trans Database Syst 24:487–493

    Article  Google Scholar 

  • Papoulis A, Pillai SU (2002) Probability, random variables and stochastic processes, 4th ed. McGraw Hill, New York

  • Park B, Ayyagari R, Kargupta H (2001) A fourier analysis-based approach to learn classifier from distributed heterogeneous data. In: Proceedings of the first SIAM internation conference on data mining. Chicago, IL

  • Park BH, Kargupta H (2003) In: Nong Ye (ed) The handbook of data mining. Lawrence Erlbaum Associates Inc Publishers, p 341

  • Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106

    Google Scholar 

  • Repository, UML (nd) http://www.ics.uci.edu/ mlearn/mlsummary.html

  • Rizvi SJ, Haritsa JR (2002) Maintaining data privacy in association rule mining. In: Proceedings of the 28th VLDB conference. Hong Kong, China

  • Schneier B (1995). In: Applied cryptography. Wiley

  • Silverstein JW (1989) On the weak limit of the largest eigenvalue of a large dimensional sample covariance matrix. J Multivar Anal 30:307–311

    Article  MathSciNet  Google Scholar 

  • Silverstein JW, Combettes PL (1992) Signal detection via spectral theory of large dimensional random matrices. IEEE Trans Signal Process 40:2100–2105

    Article  Google Scholar 

  • Stewart GW (1973) Error and perturbation bounds for subspaces associated with certain eigenvalue problems. SIAM Rev 15:727–764

    Article  MathSciNet  Google Scholar 

  • Stolfo S et al (1997) Jam: java agents for meta-learning over distributed databases. In: Proceedings of the third international conference on knowledge discovery and data mining. AAAI, Menlo Park, CA, pp 74–81

  • Traub JF, Yemini Y, Woz’niakowski H (1984) The statistical security of a statistical database. ACM Trans Database Syst 9:672–679

    Article  Google Scholar 

  • Vaidya J, Clifton C (2002) Privacy preserving association rule mining in vertically partitioned data. In: The eighth ACM SIGKDD international conference on knowledge discovery and data mining. Edmonton, Alberta, CA

  • Weyl H (1949) Inequalities between the two kinds of eigenvalues of a linear transformation. In: Proceedings of the national academy of sciences, vol 35, pp 408–411

  • Wigner EP (1952) On the statistical distribution of the widths and spacings of nuclear resonance levels. In: Proceedings of the Cambridge Philosophical Society, vol 47, pp 790–798

  • Yin YQ, Bai ZD, Krishnaiah PR (1988) On the limit of the largest eigenvalue of the large dimensional sample covariance matrix. Probabil Theory Related Fields 78:509–521

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hillol Kargupta.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kargupta, H., Datta, S., Wang, Q. et al. Random-data perturbation techniques and privacy-preserving data mining. Knowl Inf Syst 7, 387–414 (2005). https://doi.org/10.1007/s10115-004-0173-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-004-0173-6

Keywords

Navigation