ABSTRACT
The ubiquity of the internet not only makes it very convenient for individuals or organizations to share data for data mining or statistical analysis, but also greatly increases the chance of privacy breach. There exist many techniques such as random perturbation to protect the privacy of such data sets. However, perturbation often has negative impacts on the quality of data mining or statistical analysis conducted over the perturbed data. This paper studies the impact of random perturbation for a popular data mining and analysis method: linear discriminant analysis. The contributions are two fold. First, we discover that for large data sets, the impact of perturbation is quite limited (i.e., high quality results may be obtained directly from perturbed data) if the perturbation process satisfies certain conditions. Second, we discover that for small data sets, the negative impact of perturbation can be reduced by publishing additional statistics about the perturbation along with the perturbed data. We provide both theoretical derivations and experimental verifications of these results.
- C. C. Aggarwal and P. S. Yu. Privacy-Preserving Data Mining: Models and Algorithms. Springer, 2008. Google ScholarDigital Library
- D. Agrawal and C. C. Aggarwal. On the design and quantification of privacy preserving data mining algorithms. In 20th ACM SIGMOD SIGACT-SIGART Symposium on Principles of Database Systems, pages 247--255, Santa Barbara, CA, 2001. Google ScholarDigital Library
- R. Agrawal and R. Srikant. Privacy preserving data mining. In 2000 ACM SIGMOD Conference on Management of Data, pages 439--450, Dallas, TX, May 2000. Google ScholarDigital Library
- T. Dalenius and S. P. Reiss. Data-swapping: A technique for disclosure control. Journal of Statistical Planning and Inference, 6:73--85, 1982.Google ScholarCross Ref
- A. Evfimevski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy preserving data mining. In 22nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages 211--222, San Diego, CA, June 2003. Google ScholarDigital Library
- R. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7:179--188, 1936.Google ScholarCross Ref
- S. Hettich, C. Blake, and C. Merz. UCI repository of machine learning databases, 1998.Google Scholar
- Z. Huang, W. Du, and B. Chen. Derivin private information from randomized data. In SIGMOD 2005, pages 37--48, Baltimore, MD, June 2005. Google ScholarDigital Library
- H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar. On the privacy preserving properties of random data perturbation techniques. In ICDM, pages 99--106, 2003. Google ScholarDigital Library
- L. Liu, M. Kantarcioglu, and B. Thuraisingham. Privacy preserving decision tree mining from perturbed data. In HICSS, 2009.Google Scholar
- L. Liu, M. Kantarcioglu, and B. Thuraisingham. The applicability of the perturbation based privacy preserving data mining for real-world data. Data Knowl. Eng., 65:2008, 5--21. Google ScholarDigital Library
- G. J. McLachlan. Discriminant Analysis and Statistical Pattern Recognition. Wiley Interscience, 2004.Google Scholar
- S. Mukherjee, M. Banerjee, Z. Chen, and A. Gangopadhyay. A privacy preserving technique for distance-based classification with worst case privacy guarantees. Data Knowl. Eng., 66(2):264--288, 2008. Google ScholarDigital Library
- L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10:2002, 571--588. Google ScholarDigital Library
Recommendations
Privacy Preserving Data Mining Techniques: Current Scenario and Future Prospects
ICCCT '12: Proceedings of the 2012 Third International Conference on Computer and Communication TechnologyPrivacy preserving has originated as an important concern with reference to the success of the data mining. Privacy preserving data mining (PPDM) deals with protecting the privacy of individual data or sensitive knowledge without sacrificing the utility ...
Comments