Putting Statistical Disclosure Control into Practice: The ARX Data Anonymization Tool

Prasser, Fabian; Kohlmayer, Florian

doi:10.1007/978-3-319-23633-9_6

Fabian Prasser³ &
Florian Kohlmayer³

3212 Accesses
36 Citations

Abstract

The sharing of sensitive personal data has become a core element of biomedical research. To protect privacy, a broad spectrum of techniques must be implemented, including data anonymization. In this article, we present ARX, an anonymization tool for structured data which supports a broad spectrum of methods for statistical disclosure control by providing (1) models for analyzing re-identification risks, (2) risk-based anonymization, (3) syntactic privacy criteria, such as k-anonymity, ℓ-diversity, t-closeness and δ-presence, (4) methods for automated and manual evaluation of data utility, and (5) an intuitive coding model using generalization, suppression and microaggregation. ARX is highly scalable and allows for anonymizing datasets with several millions of records on commodity hardware. Moreover, it offers a comprehensive graphical user interface with wizards and visualizations that guide users through different aspects of the anonymization process. ARX is not just a toolbox, but a fully-fledged application, meaning that all implemented methods have been harmonized and integrated with each other. It is well understood that balancing privacy and data utility requires user feedback. To facilitate this interaction, ARX is highly configurable and provides various methods for exploring the solution space.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 299.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The authors “Fabian Prasser” and “Florian Kohlmayer” contributed equally to this work.

References

Article 29 Data Protection Working Party: Opinion 05/2014 on anonymisation techniques. http://www.cnpd.public.lu/fr/publications/groupe-art29/wp216_en.pdf. Accessed 22 Apr (2014)
ARX – Powerful Data Anonymization: http://arx.deidentifier.org/. Accessed 06 May (2015)
Bayardo, R., Agrawal, R.: Data privacy through optimal k-anonymization. In: Proceedings of the International Conference on Data Engineering, pp. 217–228 (2005)
Google Scholar
Byun, J., Sohn, Y., Bertino, E., Li, N.: Secure anonymization for incremental datasets. In: Proceedings of VLDB Workshop Secure Data Management, pp. 48–63 (2006)
Google Scholar
Cavoukian, A., Castro, D.: Big data and innovation, setting the record straight: de-identification does work. Privacy by Design, Ontario, Canada. http://www2.itif.org/2014-big-data-deidentification.pdf (2014). Accessed 06 May (2015)
Chen, G., Keller-McNulty, S.: Estimation of identification disclosure risk in microdata. J. Off. Stat. 14, 79–95 (1998)
Google Scholar
Ciglic, M., Eder, J., Koncilia, C.: k-anonymity of microdata with null values. In: Proceedings of International Conference on Database and Expert Systems Applications (2014)
Book Google Scholar
Ciriani, V., De Capitani di Vimercati, S., Foresti, S., Samarati, P.: Microdata protection. In: Yu, T., Jajodia, S. (eds.) Secure Data Management in Decentralized Systems. Advances in Information Security, vol. 33, pp. 291–321. Springer, Berlin (2007)
Chapter Google Scholar
Dai, C., Ghinita, G., Bertino, E., Byun, J.W., Li, N.: TIAMAT: a tool for interactive analysis of microdata anonymization techniques. In: Proceedings of the VLDB Endowment (2009)
Google Scholar
Dankar, F.K., Emam, K.E.: Practicing differential privacy in health care: a review. Trans. Data Privacy 6(1), 35–67 (2013)
Google Scholar
Dankar, F., Emam, K.E., Neisa, A., Roffey, T.: Estimating the re-identification risk of clinical data sets. BMC Med. Inform. Decis. Mak. 12(1), 66 (2012)
Article Google Scholar
Dwork, C.: An ad omnia approach to defining and achieving private data analysis. In: Proceedings of PinKDD, pp. 1–13 (2007)
Google Scholar
Dwork, C.: Differential privacy. In: Encyclopedia of Cryptography and Security, pp. 338–340. Springer, Berlin (2011)
Google Scholar
Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., Naor, M.: Our data, ourselves: privacy via distributed noise generation. In: Proceedings of EUROCRYPT 2006, pp. 486–503 (2006)
MathSciNet Google Scholar
El Emam, K., Jonker, E., Arbuckle, L., Malin, B.: A systematic review of re-identification attacks on health data. PloS One 6(12), e28071 (2011)
Article Google Scholar
Emam, K.E., Dankar, F.K., Issa, R., Jonker, E., Amyot, D., Cogo, E., Corriveau, J., Walker, M., Chowdhury, S., Vaillancourt, R., Roffey, T., Bottomley, J.: A globally optimal k-anonymity method for the de-identification of health data. J. Am. Med. Inform. Assoc. 16(5), 670–682 (2009)
Article Google Scholar
Fung, B., Wang, K., Fu, A., Yu, P.: Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques. CRC Press, Hoboken (2010)
Book Google Scholar
Gardner, J.J., Xiong, L., Li, K., Lu, J.J.: HIDE: heterogeneous information de-identification. In: Proceedings of International Conference on Extending Database Technology, pp. 1116–1119 (2009)
Google Scholar
Ghinita, G., Karras, P., Kalnis, P., Mamoulis, N.: Fast data anonymization with low information loss. In: Proceedings of the VLDB Endowment, pp. 758–769 (2007)
Google Scholar
Gkoulalas-Divanis, A., Loukides, G., Sun, J.: Publishing data from electronic health records while preserving privacy: a survey of algorithms. J. Biomed. Inform. 50, 4–19 (2014)
Article Google Scholar
Greenberg, B., Zayatz, L.: Strategies for measuring risk in public use micro-data files. Statistica Neerlandica 46(1), 33–48 (1992)
Article Google Scholar
Hoshino, N.: Applying Pitman’s sampling formula to microdata disclosure risk assessment. J. Off. Stat. 17(4), 499–520 (2001)
Google Scholar
Hundepool, A., van de Wetering, A., Ramaswamy, R., Franconi, L., Polettini, S., Capobianchi, A., de Wolf, P.P., Domingo, J., Torra, V., Brand, R., Giessing, S.: μ-Argus manual. http://neon.vb.cbs.nl/casc/Software/MuManual4.2.pdf. Accessed 22 Apr (2008)
Iyengar, V.: Transforming data to satisfy privacy constraints. In: Proceedings of International Conference on Knowledge Discovery and Data Mining, pp. 279–288 (2002)
Google Scholar
Kayaalp, M., Browne, A.C., Dodd, Z., Sagan, P., McDonald, C.: De-identification of address, date, and alphanumeric identifiers in narrative clinical reports. In: AMIA Annual Symposium Proceedings, pp. 767–776 (2014)
Google Scholar
Kohlmayer, F., Prasser, F., Eckert, C., Kemper, A., Kuhn, K.A.: Flash: efficient, stable and optimal k-anonymity. In: Proceedings of International Conference on Information Privacy, Security, Risk and Trust (2012)
Google Scholar
Kohlmayer, F., Prasser, F., Eckert, C., Kemper, A., Kuhn, K.A.: Highly efficient optimal k-anonymity for biomedical datasets. In: Proceedings of International Symposium on Computer-Based Medical Systems (2012)
Book Google Scholar
Kohlmayer, F., Prasser, F., Eckert, C., Kuhn, K.A.: A flexible approach to distributed data anonymization. J. Biomed. Inform. 50, 62–76 (2013).
Article Google Scholar
LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Incognito: efficient full-domain k-anonymity. In: Proceedings of International Conference on Management of Data, pp. 49–60 (2005)
Google Scholar
LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Multidimensional k-anonymity (TR-1521). Tech. Rep., University of Wisconsin (2005)
Google Scholar
LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: Proceedings of International Conference on Data Engineering, p. 25 (2006)
Google Scholar
Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and ℓ-diversity. In: Proceedings of International Conference on Data Engineering, pp. 106–115 (2007)
Google Scholar
Li, N., Qardaji, W.H., Su, D.: Provably private data anonymization: or, k-anonymity meets differential privacy. CoRR, abs/1101.2604 49, 55 (2011)
Google Scholar
Li, T., Li, N., Zhang, J., Molloy, I.: Slicing: a new approach for privacy preserving data publishing. Trans. Knowl. Data Eng. 24(3), 561–574 (2012)
Article Google Scholar
Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: ℓ-diversity: privacy beyond k-anonymity. Trans. Knowl. Discov. Data 1(1), 24–35 (2007)
Google Scholar
Malin, B., Benitez, K., Masys, D.: Never too old for anonymity: a statistical standard for demographic data sharing via the HIPAA privacy rule. J. Am. Med. Inform. Assoc. 18(1), 3–10 (2011)
Article Google Scholar
McSherry, F.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of International Conference on Management of Data, pp. 19–30 (2009)
Google Scholar
Minka, T.: Lightspeed Matlab toolbox. http://research.microsoft.com/en-us/um/people/minka/software/lightspeed/. Accessed 22 Apr (2014)
Narayanan, A., Felten, E.: No silver bullet: de-identification still doesn’t work. http://randomwalker.info/publications/no-silver-bullet-de-identification.pdf (2014). Accessed 06 May (2015)
Nergiz, M.E., Atzori, M., Clifton, C.: Hiding the presence of individuals from shared data-bases. In: Proceedings of International Conference on Management of Data, pp. 665–676 (2007)
Google Scholar
Poulis, G., Loukides, G., Gkoulalas-Divanis, A., Skiadopoulos, S.: Anonymizing data with relational and transaction attributes. In: Proceedings of ECML PKDD, pp. 353–369 (2013)
Google Scholar
Poulis, G., Gkoulalas-Divanis, A., Loukides, G., Skiadopoulos, S., Tryfonopoulos, C.: SECRETA: a system for evaluating and comparing relational and transaction anonymization algorithms. In: Proceedings of International Conference on Extending Database Technology, pp. 620–623 (2014)
Google Scholar
Prasser, F., Kohlmayer, F.: A simple benchmark of risk-based anonymization with ARX. https://www.github.com/arx-deidentifier/risk-benchmark. Accessed 22 Apr (2015)
Prasser, F., Kohlmayer, F., Kuhn, K.A.: A benchmark of globally-optimal anonymization methods for biomedical data. In: Proceedings of International Symposium on Computer-Based Medical Systems (2014).
Book Google Scholar
Prasser, F., Kohlmayer, F., Lautenschlaeger, R., Eckert, C., Kuhn, K.A.: ARX: a comprehensive tool for anonymizing biomedical data. In: AMIA Annual Symposium Proceedings (2014).
Google Scholar
Privacy Analytics Inc.: About PARAT de-identification software. http://www.privacyanalytics.ca/software/parat/. Accessed 22 Apr (2015)
Rinott, Y.: On models for statistical disclosure risk estimation. In: Proceedings of ECE/Eurostat Work Session on Statistical Data Confidentiality, pp. 275–285 (2003)
Google Scholar
Samarati, P., Sweeney, L.: Generalizing data to provide anonymity when disclosing information. In: Proceedings of Symposium on Principles of Database Systems, p. 188 (1998)
Google Scholar
Sweeney, L.: Datafly: a system for providing anonymity in medical data. In: Database Security, XI: Status and Prospects, p. 20 (1998)
Google Scholar
Sweeney, L.: Computational disclosure control: a primer on data privacy protection. Ph.D. thesis, MIT (2001)
Google Scholar
Templ, M.: Statistical disclosure control for microdata using the r-package sdcmicro. Trans. Data Privacy 1(2), 67–85 (2008)
MathSciNet Google Scholar
Terrovitis, M., Mamoulis, N., Kalnis, P.: Privacy-preserving anonymization of set-valued data. In: Proceedings of the VLDB Endowment (2008)
Google Scholar
U.S. Health Insurance Portability and Accountability Act of 1996. Public Law 1-349 (1996)
Google Scholar
UTD Data Security and Privacy Lab: UTD anonymization toolbox. http://www.cs.utdallas.edu/dspl/cgi-bin/toolbox/index.php. Accessed 10 June (2012)
Wikipedia: Hasse diagram. https://en.wikipedia.org/wiki/Hasse_diagram. Accessed 22 Apr (2015)
Wikipedia: Newton’s method. https://en.wikipedia.org/wiki/Newton’s_method. Accessed 22 Apr (2015)
Wikipedia: Polygamma function. https://en.wikipedia.org/wiki/Polygamma_function. Accessed 22 Apr (2015)
Xiao, X., Tao, Y.: Anatomy: simple and effective privacy preservation. In: Proceedings of the VLDB Endowment, pp. 139–150 (2006)
Google Scholar
Xiao, X., Wang, G., Gehrke, J.: Interactive anonymization of sensitive data. In: Proceedings of International Conference on Management of Data, pp. 1051–1054 (2009)
Google Scholar
Xu, J., Wang, W., Pei, J., Wang, X., Shi, B., Fu, A.W.: Utility-based anonymization using local recoding. In: Proceedings of International Conference on Knowledge Discovery and Data Mining, pp. 785–790 (2006)
Google Scholar
Zayatz, L.V.: Estimation of the percent of unique population elements on a microdata file using the sample. Statistical Research Division Report Number: Census/SRD/RR-91/08 (1991)
Google Scholar

Download references

Acknowledgements

The authors would like to express their appreciation to Klaus A. Kuhn for his many helpful and insightful comments and suggestions.

Author information

Authors and Affiliations

Biomedical Informatics, Technische Universität München, München, Germany
Fabian Prasser (Chair) & Florian Kohlmayer (Chair)

Authors

Fabian Prasser
View author publications
You can also search for this author in PubMed Google Scholar
Florian Kohlmayer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fabian Prasser .

Editor information

Editors and Affiliations

IBM Research - Ireland, Mulhuddart, Dublin, Ireland
Aris Gkoulalas-Divanis
Cardiff University, Cardiff, United Kingdom
Grigorios Loukides

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Prasser, F., Kohlmayer, F. (2015). Putting Statistical Disclosure Control into Practice: The ARX Data Anonymization Tool. In: Gkoulalas-Divanis, A., Loukides, G. (eds) Medical Data Privacy Handbook. Springer, Cham. https://doi.org/10.1007/978-3-319-23633-9_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-23633-9_6
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23632-2
Online ISBN: 978-3-319-23633-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics