Abstract
The sharing of sensitive personal data has become a core element of biomedical research. To protect privacy, a broad spectrum of techniques must be implemented, including data anonymization. In this article, we present ARX, an anonymization tool for structured data which supports a broad spectrum of methods for statistical disclosure control by providing (1) models for analyzing re-identification risks, (2) risk-based anonymization, (3) syntactic privacy criteria, such as k-anonymity, ℓ-diversity, t-closeness and δ-presence, (4) methods for automated and manual evaluation of data utility, and (5) an intuitive coding model using generalization, suppression and microaggregation. ARX is highly scalable and allows for anonymizing datasets with several millions of records on commodity hardware. Moreover, it offers a comprehensive graphical user interface with wizards and visualizations that guide users through different aspects of the anonymization process. ARX is not just a toolbox, but a fully-fledged application, meaning that all implemented methods have been harmonized and integrated with each other. It is well understood that balancing privacy and data utility requires user feedback. To facilitate this interaction, ARX is highly configurable and provides various methods for exploring the solution space.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The authors “Fabian Prasser” and “Florian Kohlmayer” contributed equally to this work.
References
Article 29 Data Protection Working Party: Opinion 05/2014 on anonymisation techniques. http://www.cnpd.public.lu/fr/publications/groupe-art29/wp216_en.pdf. Accessed 22 Apr (2014)
ARX – Powerful Data Anonymization: http://arx.deidentifier.org/. Accessed 06 May (2015)
Bayardo, R., Agrawal, R.: Data privacy through optimal k-anonymization. In: Proceedings of the International Conference on Data Engineering, pp. 217–228 (2005)
Byun, J., Sohn, Y., Bertino, E., Li, N.: Secure anonymization for incremental datasets. In: Proceedings of VLDB Workshop Secure Data Management, pp. 48–63 (2006)
Cavoukian, A., Castro, D.: Big data and innovation, setting the record straight: de-identification does work. Privacy by Design, Ontario, Canada. http://www2.itif.org/2014-big-data-deidentification.pdf (2014). Accessed 06 May (2015)
Chen, G., Keller-McNulty, S.: Estimation of identification disclosure risk in microdata. J. Off. Stat. 14, 79–95 (1998)
Ciglic, M., Eder, J., Koncilia, C.: k-anonymity of microdata with null values. In: Proceedings of International Conference on Database and Expert Systems Applications (2014)
Ciriani, V., De Capitani di Vimercati, S., Foresti, S., Samarati, P.: Microdata protection. In: Yu, T., Jajodia, S. (eds.) Secure Data Management in Decentralized Systems. Advances in Information Security, vol. 33, pp. 291–321. Springer, Berlin (2007)
Dai, C., Ghinita, G., Bertino, E., Byun, J.W., Li, N.: TIAMAT: a tool for interactive analysis of microdata anonymization techniques. In: Proceedings of the VLDB Endowment (2009)
Dankar, F.K., Emam, K.E.: Practicing differential privacy in health care: a review. Trans. Data Privacy 6(1), 35–67 (2013)
Dankar, F., Emam, K.E., Neisa, A., Roffey, T.: Estimating the re-identification risk of clinical data sets. BMC Med. Inform. Decis. Mak. 12(1), 66 (2012)
Dwork, C.: An ad omnia approach to defining and achieving private data analysis. In: Proceedings of PinKDD, pp. 1–13 (2007)
Dwork, C.: Differential privacy. In: Encyclopedia of Cryptography and Security, pp. 338–340. Springer, Berlin (2011)
Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., Naor, M.: Our data, ourselves: privacy via distributed noise generation. In: Proceedings of EUROCRYPT 2006, pp. 486–503 (2006)
El Emam, K., Jonker, E., Arbuckle, L., Malin, B.: A systematic review of re-identification attacks on health data. PloS One 6(12), e28071 (2011)
Emam, K.E., Dankar, F.K., Issa, R., Jonker, E., Amyot, D., Cogo, E., Corriveau, J., Walker, M., Chowdhury, S., Vaillancourt, R., Roffey, T., Bottomley, J.: A globally optimal k-anonymity method for the de-identification of health data. J. Am. Med. Inform. Assoc. 16(5), 670–682 (2009)
Fung, B., Wang, K., Fu, A., Yu, P.: Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques. CRC Press, Hoboken (2010)
Gardner, J.J., Xiong, L., Li, K., Lu, J.J.: HIDE: heterogeneous information de-identification. In: Proceedings of International Conference on Extending Database Technology, pp. 1116–1119 (2009)
Ghinita, G., Karras, P., Kalnis, P., Mamoulis, N.: Fast data anonymization with low information loss. In: Proceedings of the VLDB Endowment, pp. 758–769 (2007)
Gkoulalas-Divanis, A., Loukides, G., Sun, J.: Publishing data from electronic health records while preserving privacy: a survey of algorithms. J. Biomed. Inform. 50, 4–19 (2014)
Greenberg, B., Zayatz, L.: Strategies for measuring risk in public use micro-data files. Statistica Neerlandica 46(1), 33–48 (1992)
Hoshino, N.: Applying Pitman’s sampling formula to microdata disclosure risk assessment. J. Off. Stat. 17(4), 499–520 (2001)
Hundepool, A., van de Wetering, A., Ramaswamy, R., Franconi, L., Polettini, S., Capobianchi, A., de Wolf, P.P., Domingo, J., Torra, V., Brand, R., Giessing, S.: μ-Argus manual. http://neon.vb.cbs.nl/casc/Software/MuManual4.2.pdf. Accessed 22 Apr (2008)
Iyengar, V.: Transforming data to satisfy privacy constraints. In: Proceedings of International Conference on Knowledge Discovery and Data Mining, pp. 279–288 (2002)
Kayaalp, M., Browne, A.C., Dodd, Z., Sagan, P., McDonald, C.: De-identification of address, date, and alphanumeric identifiers in narrative clinical reports. In: AMIA Annual Symposium Proceedings, pp. 767–776 (2014)
Kohlmayer, F., Prasser, F., Eckert, C., Kemper, A., Kuhn, K.A.: Flash: efficient, stable and optimal k-anonymity. In: Proceedings of International Conference on Information Privacy, Security, Risk and Trust (2012)
Kohlmayer, F., Prasser, F., Eckert, C., Kemper, A., Kuhn, K.A.: Highly efficient optimal k-anonymity for biomedical datasets. In: Proceedings of International Symposium on Computer-Based Medical Systems (2012)
Kohlmayer, F., Prasser, F., Eckert, C., Kuhn, K.A.: A flexible approach to distributed data anonymization. J. Biomed. Inform. 50, 62–76 (2013).
LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Incognito: efficient full-domain k-anonymity. In: Proceedings of International Conference on Management of Data, pp. 49–60 (2005)
LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Multidimensional k-anonymity (TR-1521). Tech. Rep., University of Wisconsin (2005)
LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: Proceedings of International Conference on Data Engineering, p. 25 (2006)
Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and ℓ-diversity. In: Proceedings of International Conference on Data Engineering, pp. 106–115 (2007)
Li, N., Qardaji, W.H., Su, D.: Provably private data anonymization: or, k-anonymity meets differential privacy. CoRR, abs/1101.2604 49, 55 (2011)
Li, T., Li, N., Zhang, J., Molloy, I.: Slicing: a new approach for privacy preserving data publishing. Trans. Knowl. Data Eng. 24(3), 561–574 (2012)
Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: ℓ-diversity: privacy beyond k-anonymity. Trans. Knowl. Discov. Data 1(1), 24–35 (2007)
Malin, B., Benitez, K., Masys, D.: Never too old for anonymity: a statistical standard for demographic data sharing via the HIPAA privacy rule. J. Am. Med. Inform. Assoc. 18(1), 3–10 (2011)
McSherry, F.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of International Conference on Management of Data, pp. 19–30 (2009)
Minka, T.: Lightspeed Matlab toolbox. http://research.microsoft.com/en-us/um/people/minka/software/lightspeed/. Accessed 22 Apr (2014)
Narayanan, A., Felten, E.: No silver bullet: de-identification still doesn’t work. http://randomwalker.info/publications/no-silver-bullet-de-identification.pdf (2014). Accessed 06 May (2015)
Nergiz, M.E., Atzori, M., Clifton, C.: Hiding the presence of individuals from shared data-bases. In: Proceedings of International Conference on Management of Data, pp. 665–676 (2007)
Poulis, G., Loukides, G., Gkoulalas-Divanis, A., Skiadopoulos, S.: Anonymizing data with relational and transaction attributes. In: Proceedings of ECML PKDD, pp. 353–369 (2013)
Poulis, G., Gkoulalas-Divanis, A., Loukides, G., Skiadopoulos, S., Tryfonopoulos, C.: SECRETA: a system for evaluating and comparing relational and transaction anonymization algorithms. In: Proceedings of International Conference on Extending Database Technology, pp. 620–623 (2014)
Prasser, F., Kohlmayer, F.: A simple benchmark of risk-based anonymization with ARX. https://www.github.com/arx-deidentifier/risk-benchmark. Accessed 22 Apr (2015)
Prasser, F., Kohlmayer, F., Kuhn, K.A.: A benchmark of globally-optimal anonymization methods for biomedical data. In: Proceedings of International Symposium on Computer-Based Medical Systems (2014).
Prasser, F., Kohlmayer, F., Lautenschlaeger, R., Eckert, C., Kuhn, K.A.: ARX: a comprehensive tool for anonymizing biomedical data. In: AMIA Annual Symposium Proceedings (2014).
Privacy Analytics Inc.: About PARAT de-identification software. http://www.privacyanalytics.ca/software/parat/. Accessed 22 Apr (2015)
Rinott, Y.: On models for statistical disclosure risk estimation. In: Proceedings of ECE/Eurostat Work Session on Statistical Data Confidentiality, pp. 275–285 (2003)
Samarati, P., Sweeney, L.: Generalizing data to provide anonymity when disclosing information. In: Proceedings of Symposium on Principles of Database Systems, p. 188 (1998)
Sweeney, L.: Datafly: a system for providing anonymity in medical data. In: Database Security, XI: Status and Prospects, p. 20 (1998)
Sweeney, L.: Computational disclosure control: a primer on data privacy protection. Ph.D. thesis, MIT (2001)
Templ, M.: Statistical disclosure control for microdata using the r-package sdcmicro. Trans. Data Privacy 1(2), 67–85 (2008)
Terrovitis, M., Mamoulis, N., Kalnis, P.: Privacy-preserving anonymization of set-valued data. In: Proceedings of the VLDB Endowment (2008)
U.S. Health Insurance Portability and Accountability Act of 1996. Public Law 1-349 (1996)
UTD Data Security and Privacy Lab: UTD anonymization toolbox. http://www.cs.utdallas.edu/dspl/cgi-bin/toolbox/index.php. Accessed 10 June (2012)
Wikipedia: Hasse diagram. https://en.wikipedia.org/wiki/Hasse_diagram. Accessed 22 Apr (2015)
Wikipedia: Newton’s method. https://en.wikipedia.org/wiki/Newton’s_method. Accessed 22 Apr (2015)
Wikipedia: Polygamma function. https://en.wikipedia.org/wiki/Polygamma_function. Accessed 22 Apr (2015)
Xiao, X., Tao, Y.: Anatomy: simple and effective privacy preservation. In: Proceedings of the VLDB Endowment, pp. 139–150 (2006)
Xiao, X., Wang, G., Gehrke, J.: Interactive anonymization of sensitive data. In: Proceedings of International Conference on Management of Data, pp. 1051–1054 (2009)
Xu, J., Wang, W., Pei, J., Wang, X., Shi, B., Fu, A.W.: Utility-based anonymization using local recoding. In: Proceedings of International Conference on Knowledge Discovery and Data Mining, pp. 785–790 (2006)
Zayatz, L.V.: Estimation of the percent of unique population elements on a microdata file using the sample. Statistical Research Division Report Number: Census/SRD/RR-91/08 (1991)
Acknowledgements
The authors would like to express their appreciation to Klaus A. Kuhn for his many helpful and insightful comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Prasser, F., Kohlmayer, F. (2015). Putting Statistical Disclosure Control into Practice: The ARX Data Anonymization Tool. In: Gkoulalas-Divanis, A., Loukides, G. (eds) Medical Data Privacy Handbook. Springer, Cham. https://doi.org/10.1007/978-3-319-23633-9_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-23633-9_6
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23632-2
Online ISBN: 978-3-319-23633-9
eBook Packages: Computer ScienceComputer Science (R0)