ABSTRACT
Data on individuals and entities are being collected widely. These data can contain information that explicitly identifies the individual (e.g., social security number). Data can also contain other kinds of personal information (e.g., date of birth, zip code, gender) that are potentially identifying when linked with other available data sets. Data are often shared for business or legal reasons. This paper addresses the important issue of preserving the anonymity of the individuals or entities during the data dissemination process. We explore preserving the anonymity by the use of generalizations and suppressions on the potentially identifying portions of the data. We extend earlier works in this area along various dimensions. First, satisfying privacy constraints is considered in conjunction with the usage for the data being disseminated. This allows us to optimize the process of preserving privacy for the specified usage. In particular, we investigate the privacy transformation in the context of data mining applications like building classification and regression models. Second, our work improves on previous approaches by allowing more flexible generalizations for the data. Lastly, this is combined with a more thorough exploration of the solution space using the genetic algorithm framework. These extensions allow us to transform the data so that they are more useful for their intended purpose while satisfying the privacy constraints.
- R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proceedings of ACM SIGMOD Conference on Management of Data, 2000. Google ScholarDigital Library
- C. Blake, E. Keogh, and C. Merz. UCI repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Science, URL=http://www.ics.uci.edu/~mlearn/MLRespository.html, 1998.Google Scholar
- G. Chen and S. Keller-McNulty. Estimation of identification risk in microdata. Journal of Official Statistics, 14(1):79--95, 1998.Google Scholar
- J. Domingo-Ferrer, J. Mateo-Sanz, and V. Torra. Comparing SDC methods for microdata on the basis of information loss and disclosure risk. In Proceedings of NTTS and ETK, 2001.Google Scholar
- J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretization of continuous features. In Proceedings of Twelfth International Conference on Machine Learning, 1995.Google ScholarDigital Library
- G. Duncan and D. Lambert. Disclosure-limited data dissemination. Journal of the American Statistical Association, 81(393):10--28, 1986.Google ScholarCross Ref
- D. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, 1989. Google ScholarDigital Library
- J. Holland. Adaptation in Natural and Artificial Systems. University of Michigan Press, 1975. Google ScholarDigital Library
- S. Hong. Use of contextual information for feature ranking and discretization. IEEE Transactions on Knowledge and Data Engineering, 9(5):718--730, 1997. Google ScholarDigital Library
- A. Hundepool and L. Willenborg. μ- and τ- argus: Software for statistical disclosure control. In Proceedings of Third Internation Seminar on Statistical Confidentiality, 1996.Google Scholar
- J. Kim and W. Winkler. Masking microdata files. In ASA Proceedings of the Section on Survey Research Methods, pages 114--119, 1995.Google Scholar
- D. Lambert. Measures of disclosure risk and harm. Journal off Official Statistics, 9(2):313--331, 1993.Google Scholar
- J. Quinlan. Induction of decision trees. Machine Learning, 1:81--106, 1986. Google ScholarCross Ref
- P. Samarati. Protecting respondents' identities in microdata release. IEEE Transactions on Knowledge Engineering, 13(6):1010--1027, 2001. Google ScholarDigital Library
- P. Samarati and L. Sweeney. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical Report Technical Report, SRI International, March 1998.Google Scholar
- C. Skinner. On identification disclosure and prediction disclosure for microdata. Statistica Neerlandica, 46(1):21--32, 1992.Google ScholarCross Ref
- L. Sweeney. Datafly: A system for providing anonymity in medical data. In Proceedings of Eleventh International Conference on Database Security, pages 356--381. Database Security XI: Status and Prospects, 1998. Google ScholarDigital Library
- D. Whitley. The genitor algorithm and selective pressure: Why rank-based allocation of reproductive trials is best. In Proceedings of Third International Conference on Genetic Algorithms, pages 116--121. Morgan Kaufmann, 1989. Google ScholarDigital Library
- L. Willenborg and T. D. Waal. Statistical Disclosure Control in Practice. Springer-Verlag, 1996.Google ScholarCross Ref
- L. Willenborg and T. D. Waal. Elements of Statistical Disclosure Control. Springer-Verlag, 2000.Google Scholar
- W. Yancey, W. Winkler, and R. Creecy. Disclosure risk assessment in perturbative microdata protection. Technical Report Research Report Statistics 2002--01, Statistical Research Division, U.S. Bureau of the Census, 2002. Google ScholarDigital Library
Index Terms
- Transforming data to satisfy privacy constraints
Recommendations
K-Anonymity for Preserving Data on Hands-Using Android Application Development
ISEC '16: Proceedings of the 9th India Software Engineering ConferenceIn this Paper, privacy preserving of personal data using K-anonymity on hands- an Android Application is developed. Due to vast increase and its usage many people are interested to carry a mobile instead of a lap-top, because mobile is not only confined ...
A Study on the Impact of Data Anonymization on Anti-discrimination
ICDMW '12: Proceedings of the 2012 IEEE 12th International Conference on Data Mining WorkshopsIn last years, data mining has raised some concerns related to privacy invasion of the individuals and potential discrimination based on the extracted patterns and profiles. Efforts at fighting against these risks have led to developing privacy ...
Yet another privacy metric for publishing micro-data
WPES '08: Proceedings of the 7th ACM workshop on Privacy in the electronic societyRecently many schemes, including k-anonymity [8], l-diversity [6] and t-closeness [5] have been introduced for preserving individual privacy when publishing database tables. Furthermore k-anonymity and l-diversity have been shown to have weaknesses. In ...
Comments