ABSTRACT
Transaction data about individuals are increasingly collected to support a plethora of applications, spanning from marketing to biomedical studies. Publishing these data is required by many organizations, but may result in privacy breaches, if an attacker exploits potentially identifying information to link individuals to their records in the published data. Algorithms that prevent this threat by transforming transaction data prior to their release have been proposed recently, but incur significant information loss due to their inability to accommodate a range of different privacy requirements that data owners often have. To address this issue, we propose a novel clustering-based framework to anonymizing transaction data. Our framework provides the basis for designing algorithms that explore a larger solution space than existing methods, which allows publishing data with less information loss, and can satisfy a wide range of privacy requirements. Based on this framework, we develop PCTA, a generalization-based algorithm to construct anonymizations that incur a small amount of information loss under many different privacy requirements. Experiments with benchmark datasets verify that PCTA significantly outperforms the current state-of-the-art algorithms in terms of data utility, while being comparable in terms of efficiency.
- National Institutes of Health. Policy for sharing of data obtained in NIH supported or conducted genome-wide association studies. NOT-OD-07-088. 2007.Google Scholar
- Health insurance portability and accountability act of 1996 united states public law.Google Scholar
- R. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In ICDE, pages 217--228, 2005. Google ScholarDigital Library
- J. Byun, A. Kamra, E. Bertino, and N. Li. Efficient k-anonymity using clustering technique. In DASFAA, pages 188--200, 2007. Google ScholarDigital Library
- J. Cao, P. Karras, C. Raïssi, and K. Tan. rho-uncertainty: Inference-proof transaction anonymization. PVLDB. 3(1):1033--1044, 2010. Google ScholarDigital Library
- C.-C. Chang, B. Thompson, H. Wang, and D. Yao. Towards publishing recommendation data with predictive anonymization. In 5th ACM Symposium on Information, Computer and Communications Security, pages 24--35, 2010. Google ScholarDigital Library
- B. Chen, D. Kifer, K. LeFevre, and A. Machanavajjhala. Privacy-preserving data publishing. Found. Trends databases, 2(1--2):1--167, 2009. Google ScholarDigital Library
- J. Domingo-Ferrer and V. Torra. Ordinal, continuous and heterogeneous k-anonymity through microaggregation. DMKD, 11(2):195--212, 2005. Google ScholarDigital Library
- B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu. Privacy-preserving data publishing: A survey on recent developments. ACM Comput. Surv., 42, 2010. Google ScholarDigital Library
- B. C. M. Fung, K. Wang, and P. S. Yu. Top-down specialization for information and privacy preservation. In ICDE, pages 205--216, 2005. Google ScholarDigital Library
- G. Ghinita, Y. Tao, and P. Kalnis. On the anonymization of sparse high-dimensional data. In ICDE, pages 715--724, 2008. Google ScholarDigital Library
- A. Gkoulalas-Divanis and V. Verykios. A free terrain model for trajectory k-anonymity. In DEXA, pages 49--56, 2008. Google ScholarDigital Library
- A. Gkoulalas-Divanis and V. S. Verykios. Privacy in Trajectory Data, chapter 11, pages 199--212. Social Implications of Data Mining and Information Privacy: Interdisciplinary Frameworks and Solutions. Information Science Reference, 2008.Google Scholar
- Y. He and J. F. Naughton. Anonymization of set-valued data via top-down, local generalization. PVLDB, 2(1):934--945, 2009. Google ScholarDigital Library
- V. S. Iyengar. Transforming data to satisfy privacy constraints. In KDD, pages 279--288, 2002. Google ScholarDigital Library
- S. Jha, L. Kruger, and P. McDaniel. Privacy preserving clustering. In ESORICS, pages 397--417, 2005. Google ScholarDigital Library
- S. Kisilevich, L. Rokach, Y. Elovici, and B. Shapira. Efficient multidimensional suppression for k-anonymity. TKDE, 22:334--347, 2010. Google ScholarDigital Library
- K. LeFevre, D. DeWitt, and R. Ramakrishnan. Mondrian multidimensional k-anonymity. In ICDE, page 25, 2006. Google ScholarDigital Library
- J. Li, R. Wong, A. Fu, and J. Pei. Achieving -anonymity by clustering in attribute hierarchical structures. In DaWaK, pages 405--416, 2006. Google ScholarDigital Library
- K. Liu and E. Terzi. Towards identity anonymization on graphs. In 2008 SIGMOD, pages 93--106, 2008. Google ScholarDigital Library
- G. Loukides, A. Gkoulalas-Divanis, and B. Malin. COAT: COnstraint-based Anonymization of Transactions. KAIS. To Appear. Google ScholarDigital Library
- G. Loukides, A. Gkoulalas-Divanis, and B. Malin. Anonymization of electronic medical records for validating genome-wide association studies. PNAS, 17:7898--7903, 2010.Google ScholarCross Ref
- G. Loukides, A. Gkoulalas-Divanis, and J. Shao. Anonymizing transaction data to eliminate sensitive inferences. In DEXA, pages 400--415, 2010. Google ScholarDigital Library
- G. Loukides and J. Shao. Capturing data usefulness and privacy protection in k-anonymisation. In SAC, pages 370--374, 2007. Google ScholarDigital Library
- A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse datasets. In IEEE S&P, pages 111--125, 2008. Google ScholarDigital Library
- M. E. Nergiz and C. Clifton. Thoughts on k-anonymization. DKE, 63(3):622--645, 2007. Google ScholarDigital Library
- T. D. of State Health Services. User manual of texas hospital inpatient discharge public use data file. http://www.dshs.state.tx.us/THCIC/, 2008.Google Scholar
- R. G. Pensa, A. Monreale, F. Pinelli, and D. Pedreschi. Pattern-preserving k-anonymization of sequences and its application to mobility data mining. In Workshop on Privacy in Location-Based Applications, 2008.Google Scholar
- S. J. Rizvi and J. R. Haritsa. Maintaining data privacy in association rule mining. In VLDB, pages 682--693, 2002. Google ScholarDigital Library
- P. Samarati. Protecting respondents identities in microdata release. TKDE, 13(9):1010--1027, 2001. Google ScholarDigital Library
- L. Sweeney. k-anonymity: a model for protecting privacy. IJUFKS, 10:557--570, 2002. Google ScholarDigital Library
- M. Terrovitis, N. Mamoulis, and P. Kalnis. Local and global recoding methods for anonymizing set-valued data. VLDB J. To appear. Google ScholarDigital Library
- M. Terrovitis, N. Mamoulis, and P. Kalnis. Privacy-preserving anonymization of set-valued data. PVLDB, 1(1):115--125, 2008. Google ScholarDigital Library
- V. S. Verykios, M. L. Damiani, and A. Gkoulalas-Divanis. Privacy and Security in Spatiotemporal Data and Trajectories, chapter 8, pages 213--240. Mobility, Data Mining and Privacy: Geographic Knowledge Discovery. Springer, 2008.Google ScholarCross Ref
- J. Xu, W. Wang, J. Pei, X. Wang, B. Shi, and A. W.-C. Fu. Utility-based anonymization using local recoding. In KDD, pages 785--790, 2006. Google ScholarDigital Library
- R. Xu and D. C. Wunsch. Clustering. Wiley-IEEE Press, 2008. Google ScholarDigital Library
- Y. Xu, K. Wang, A. W.-C. Fu, and P. S. Yu. Anonymizing transaction databases for publication. In KDD, pages 767--775, 2008. Google ScholarDigital Library
- Z. Zheng, R. Kohavi, and L. Mason. Real world performance of association rule algorithms. In KDD, pages 401--406, 2001. Google ScholarDigital Library
Index Terms
- PCTA: privacy-constrained clustering-based transaction data anonymization
Recommendations
Efficient and flexible anonymization of transaction data
Transaction data are increasingly used in applications, such as marketing research and biomedical studies. Publishing these data, however, may risk privacy breaches, as they often contain personal information about individuals. Approaches to anonymizing ...
Anonymizing transaction data by integrating suppression and generalization
PAKDD'10: Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part IPrivacy protection in publishing transaction data is an important problem. A key feature of transaction data is the extreme sparsity, which renders any single technique ineffective in anonymizing such data. Among recent works, some incur high ...
Freedom of Privacy: Anonymous Data Collection with Respondent-Defined Privacy Protection
The massive amount of sensitive survey data about individuals that agencies collect and share through the Internet is causing a great deal of privacy concerns. These concerns may discourage individuals from revealing their sensitive information. ...
Comments