Abstract
Privacy is becoming an increasingly important issue in many data-mining applications. This has triggered the development of many privacy-preserving data-mining techniques. A large fraction of them use randomized data-distortion techniques to mask the data for preserving the privacy of sensitive data. This methodology attempts to hide the sensitive data by randomly modifying the data values often using additive noise. This paper questions the utility of the random-value distortion technique in privacy preservation. The paper first notes that random matrices have predictable structures in the spectral domain and then it develops a random matrix-based spectral-filtering technique to retrieve original data from the dataset distorted by adding random values. The proposed method works by comparing the spectrum generated from the observed data with that of random matrices. This paper presents the theoretical foundation and extensive experimental results to demonstrate that, in many cases, random-data distortion preserves very little data privacy. The analytical framework presented in this paper also points out several possible avenues for the development of new privacy-preserving data-mining techniques. Examples include algorithms that explicitly guard against privacy breaches through linear transformations, exploiting multiplicative and colored noise for preserving privacy in data mining applications.
Similar content being viewed by others
References
Agrawal D, Aggawal CC (2001) On the design and quantification of privacy preserving data-mining algorothms. In: Proceedings of the 20th ACM SIMOD symposium on principles of database systems. Santa Barbara, pp 247–255
Agrawal R, Srikant R (2000) Privacy-preserving data mining. In: Proceeding of the ACM SIGMOD conference on management of data. ACM Press, Dallas, TX, pp 439–450
Bai ZD, Silverstein JW, Yin YQ (1988) A note on the largest eigenvalue of a large dimensional sample covariance matrix. J Multivar Anal 26(2):166–168
Brand R (2002) Microdata protection through noise addition. In: Inference control in statistical databases from theory to practice. Springer, Berlin, Heidelberg, New York, pp 97–116
Du W, Atallah MJ (2001) Secure multi-party computation problems and their applications: a review and open problems. In: New security paradigms workshop, pp 11–20
Evfimevski A, Gehrke J, Srikant R (2003) Limiting privacy breaches in privacy preserving data mining. In: Proceedings of the ACM SIMOD/PODS conference. San Diego, CA
Evfimevski A, Srikant R, Agrawal R, Gehrke J (2002) Privacy preserving mining of association rules. In: Proceedings of the ACM SIKDD conference. Edmonton, Canada
Evfimevski S (2002) Randomization techniques for privacy preserving association rule mining. In: SIGKDD explorations, vol 4(2)
Geman S (1980) A limit theorem for the norm of random matrices. Ann Probabil 8:252–261
Grenander U, Silverstein JW (1977) Spectral analysis of networks with random topologies. SIAM J Appl Math 32:499–519
Ham F, Faour N, Wheeler J (1999) Infrasound signal separation using independent component analysis. In: 21st seismic research symposium
Iyengar VS (2002) Transforming data to satisfy privacy constraints. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 279–288
Jackson JE (1991) A users’ guide to principal components. Wiley
Janson S, Rucinski TL, Rucinski A (2000) Random graphs, 1st ed. Wiley
Johnson E, Kargupta H (nd) Collective, hierarchical clustering from distributed, heterogeneous data. Lecture notes in computer science
Jonsson D (1982) Some limit theorems for the eigenvalues of a sample covariance matrix. J Multivar Anal 12:1–38
Kantarcioglu M, Clifton C (2002) Privacy-preserving distributed mining of association rules on horizontally partitioned data. In: SIGMOD workshop on DMKD. Madison, WI
Kargupta H, Park B, Hershberger D, Johnson E (2000) Collective data mining: a new perspective towards distributed data mining. In: Kargupta H, Chan P (eds) Advances in distributed and parallel knowledge discovery. AAAI/MIT Press
Kargupta H, Park H, Pittie S, Liu L, Kushraj D, Sarkar K (2001) MobiMine: monitoring the stock market from a PDA. ACM SIGKDD Explor 3:37–47
Kargupta H, Sivakumar K, Ghosh S (2002) Dependency detection in mobimine and random matrices. In: Proceedings of the 6th European conference on principles and practice of knowledge discovery in databases. Springer, Berlin, Heidelberg, New York, pp 250–262
Liew CK, Choi UJ, Liew CJ (1985) A data distortion by probability distribution. ACM Trans Database Syst 10:395–411
Lindell Y, Pinkas B (2000) Privacy preserving data mining. In: Advances in cryptology CRYPTO 2000, pp 36–54
Liu K, Kargupta H, Ryan J (2003) Random projection and privacy preserving correlation computation from distributed data. Technical report, University of Maryland, Baltimore County, Computer Science and Electrical Engineering Department. Technical report TR-CS-03-24
Manolakis DG, Ingle VK, Kogon SM (2000) Statistical and adaptive signal processing. McGraw Hill
Marcenko VA, Pastur LA (1967) Distribution of eigenvalues for some sets of random matrices. Math USSR—Sbornik 1:457–483
Mehta ML (1991) Random matrices, 2nd ed. Academic, London
Muralidhar K, Sarathy R (1999) Security of random data perturbation methods. ACM Trans Database Syst 24:487–493
Papoulis A, Pillai SU (2002) Probability, random variables and stochastic processes, 4th ed. McGraw Hill, New York
Park B, Ayyagari R, Kargupta H (2001) A fourier analysis-based approach to learn classifier from distributed heterogeneous data. In: Proceedings of the first SIAM internation conference on data mining. Chicago, IL
Park BH, Kargupta H (2003) In: Nong Ye (ed) The handbook of data mining. Lawrence Erlbaum Associates Inc Publishers, p 341
Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106
Repository, UML (nd) http://www.ics.uci.edu/ mlearn/mlsummary.html
Rizvi SJ, Haritsa JR (2002) Maintaining data privacy in association rule mining. In: Proceedings of the 28th VLDB conference. Hong Kong, China
Schneier B (1995). In: Applied cryptography. Wiley
Silverstein JW (1989) On the weak limit of the largest eigenvalue of a large dimensional sample covariance matrix. J Multivar Anal 30:307–311
Silverstein JW, Combettes PL (1992) Signal detection via spectral theory of large dimensional random matrices. IEEE Trans Signal Process 40:2100–2105
Stewart GW (1973) Error and perturbation bounds for subspaces associated with certain eigenvalue problems. SIAM Rev 15:727–764
Stolfo S et al (1997) Jam: java agents for meta-learning over distributed databases. In: Proceedings of the third international conference on knowledge discovery and data mining. AAAI, Menlo Park, CA, pp 74–81
Traub JF, Yemini Y, Woz’niakowski H (1984) The statistical security of a statistical database. ACM Trans Database Syst 9:672–679
Vaidya J, Clifton C (2002) Privacy preserving association rule mining in vertically partitioned data. In: The eighth ACM SIGKDD international conference on knowledge discovery and data mining. Edmonton, Alberta, CA
Weyl H (1949) Inequalities between the two kinds of eigenvalues of a linear transformation. In: Proceedings of the national academy of sciences, vol 35, pp 408–411
Wigner EP (1952) On the statistical distribution of the widths and spacings of nuclear resonance levels. In: Proceedings of the Cambridge Philosophical Society, vol 47, pp 790–798
Yin YQ, Bai ZD, Krishnaiah PR (1988) On the limit of the largest eigenvalue of the large dimensional sample covariance matrix. Probabil Theory Related Fields 78:509–521
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kargupta, H., Datta, S., Wang, Q. et al. Random-data perturbation techniques and privacy-preserving data mining. Knowl Inf Syst 7, 387–414 (2005). https://doi.org/10.1007/s10115-004-0173-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-004-0173-6