Skip to main content

Putting Statistical Disclosure Control into Practice: The ARX Data Anonymization Tool

  • Chapter
Medical Data Privacy Handbook

Abstract

The sharing of sensitive personal data has become a core element of biomedical research. To protect privacy, a broad spectrum of techniques must be implemented, including data anonymization. In this article, we present ARX, an anonymization tool for structured data which supports a broad spectrum of methods for statistical disclosure control by providing (1) models for analyzing re-identification risks, (2) risk-based anonymization, (3) syntactic privacy criteria, such as k-anonymity, -diversity, t-closeness and δ-presence, (4) methods for automated and manual evaluation of data utility, and (5) an intuitive coding model using generalization, suppression and microaggregation. ARX is highly scalable and allows for anonymizing datasets with several millions of records on commodity hardware. Moreover, it offers a comprehensive graphical user interface with wizards and visualizations that guide users through different aspects of the anonymization process. ARX is not just a toolbox, but a fully-fledged application, meaning that all implemented methods have been harmonized and integrated with each other. It is well understood that balancing privacy and data utility requires user feedback. To facilitate this interaction, ARX is highly configurable and provides various methods for exploring the solution space.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 299.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The authors “Fabian Prasser” and “Florian Kohlmayer” contributed equally to this work.

References

  1. Article 29 Data Protection Working Party: Opinion 05/2014 on anonymisation techniques. http://www.cnpd.public.lu/fr/publications/groupe-art29/wp216_en.pdf. Accessed 22 Apr (2014)

  2. ARX – Powerful Data Anonymization: http://arx.deidentifier.org/. Accessed 06 May (2015)

  3. Bayardo, R., Agrawal, R.: Data privacy through optimal k-anonymization. In: Proceedings of the International Conference on Data Engineering, pp. 217–228 (2005)

    Google Scholar 

  4. Byun, J., Sohn, Y., Bertino, E., Li, N.: Secure anonymization for incremental datasets. In: Proceedings of VLDB Workshop Secure Data Management, pp. 48–63 (2006)

    Google Scholar 

  5. Cavoukian, A., Castro, D.: Big data and innovation, setting the record straight: de-identification does work. Privacy by Design, Ontario, Canada. http://www2.itif.org/2014-big-data-deidentification.pdf (2014). Accessed 06 May (2015)

  6. Chen, G., Keller-McNulty, S.: Estimation of identification disclosure risk in microdata. J. Off. Stat. 14, 79–95 (1998)

    Google Scholar 

  7. Ciglic, M., Eder, J., Koncilia, C.: k-anonymity of microdata with null values. In: Proceedings of International Conference on Database and Expert Systems Applications (2014)

    Book  Google Scholar 

  8. Ciriani, V., De Capitani di Vimercati, S., Foresti, S., Samarati, P.: Microdata protection. In: Yu, T., Jajodia, S. (eds.) Secure Data Management in Decentralized Systems. Advances in Information Security, vol. 33, pp. 291–321. Springer, Berlin (2007)

    Chapter  Google Scholar 

  9. Dai, C., Ghinita, G., Bertino, E., Byun, J.W., Li, N.: TIAMAT: a tool for interactive analysis of microdata anonymization techniques. In: Proceedings of the VLDB Endowment (2009)

    Google Scholar 

  10. Dankar, F.K., Emam, K.E.: Practicing differential privacy in health care: a review. Trans. Data Privacy 6(1), 35–67 (2013)

    Google Scholar 

  11. Dankar, F., Emam, K.E., Neisa, A., Roffey, T.: Estimating the re-identification risk of clinical data sets. BMC Med. Inform. Decis. Mak. 12(1), 66 (2012)

    Article  Google Scholar 

  12. Dwork, C.: An ad omnia approach to defining and achieving private data analysis. In: Proceedings of PinKDD, pp. 1–13 (2007)

    Google Scholar 

  13. Dwork, C.: Differential privacy. In: Encyclopedia of Cryptography and Security, pp. 338–340. Springer, Berlin (2011)

    Google Scholar 

  14. Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., Naor, M.: Our data, ourselves: privacy via distributed noise generation. In: Proceedings of EUROCRYPT 2006, pp. 486–503 (2006)

    MathSciNet  Google Scholar 

  15. El Emam, K., Jonker, E., Arbuckle, L., Malin, B.: A systematic review of re-identification attacks on health data. PloS One 6(12), e28071 (2011)

    Article  Google Scholar 

  16. Emam, K.E., Dankar, F.K., Issa, R., Jonker, E., Amyot, D., Cogo, E., Corriveau, J., Walker, M., Chowdhury, S., Vaillancourt, R., Roffey, T., Bottomley, J.: A globally optimal k-anonymity method for the de-identification of health data. J. Am. Med. Inform. Assoc. 16(5), 670–682 (2009)

    Article  Google Scholar 

  17. Fung, B., Wang, K., Fu, A., Yu, P.: Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques. CRC Press, Hoboken (2010)

    Book  Google Scholar 

  18. Gardner, J.J., Xiong, L., Li, K., Lu, J.J.: HIDE: heterogeneous information de-identification. In: Proceedings of International Conference on Extending Database Technology, pp. 1116–1119 (2009)

    Google Scholar 

  19. Ghinita, G., Karras, P., Kalnis, P., Mamoulis, N.: Fast data anonymization with low information loss. In: Proceedings of the VLDB Endowment, pp. 758–769 (2007)

    Google Scholar 

  20. Gkoulalas-Divanis, A., Loukides, G., Sun, J.: Publishing data from electronic health records while preserving privacy: a survey of algorithms. J. Biomed. Inform. 50, 4–19 (2014)

    Article  Google Scholar 

  21. Greenberg, B., Zayatz, L.: Strategies for measuring risk in public use micro-data files. Statistica Neerlandica 46(1), 33–48 (1992)

    Article  Google Scholar 

  22. Hoshino, N.: Applying Pitman’s sampling formula to microdata disclosure risk assessment. J. Off. Stat. 17(4), 499–520 (2001)

    Google Scholar 

  23. Hundepool, A., van de Wetering, A., Ramaswamy, R., Franconi, L., Polettini, S., Capobianchi, A., de Wolf, P.P., Domingo, J., Torra, V., Brand, R., Giessing, S.: μ-Argus manual. http://neon.vb.cbs.nl/casc/Software/MuManual4.2.pdf. Accessed 22 Apr (2008)

  24. Iyengar, V.: Transforming data to satisfy privacy constraints. In: Proceedings of International Conference on Knowledge Discovery and Data Mining, pp. 279–288 (2002)

    Google Scholar 

  25. Kayaalp, M., Browne, A.C., Dodd, Z., Sagan, P., McDonald, C.: De-identification of address, date, and alphanumeric identifiers in narrative clinical reports. In: AMIA Annual Symposium Proceedings, pp. 767–776 (2014)

    Google Scholar 

  26. Kohlmayer, F., Prasser, F., Eckert, C., Kemper, A., Kuhn, K.A.: Flash: efficient, stable and optimal k-anonymity. In: Proceedings of International Conference on Information Privacy, Security, Risk and Trust (2012)

    Google Scholar 

  27. Kohlmayer, F., Prasser, F., Eckert, C., Kemper, A., Kuhn, K.A.: Highly efficient optimal k-anonymity for biomedical datasets. In: Proceedings of International Symposium on Computer-Based Medical Systems (2012)

    Book  Google Scholar 

  28. Kohlmayer, F., Prasser, F., Eckert, C., Kuhn, K.A.: A flexible approach to distributed data anonymization. J. Biomed. Inform. 50, 62–76 (2013).

    Article  Google Scholar 

  29. LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Incognito: efficient full-domain k-anonymity. In: Proceedings of International Conference on Management of Data, pp. 49–60 (2005)

    Google Scholar 

  30. LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Multidimensional k-anonymity (TR-1521). Tech. Rep., University of Wisconsin (2005)

    Google Scholar 

  31. LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: Proceedings of International Conference on Data Engineering, p. 25 (2006)

    Google Scholar 

  32. Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and -diversity. In: Proceedings of International Conference on Data Engineering, pp. 106–115 (2007)

    Google Scholar 

  33. Li, N., Qardaji, W.H., Su, D.: Provably private data anonymization: or, k-anonymity meets differential privacy. CoRR, abs/1101.2604 49, 55 (2011)

    Google Scholar 

  34. Li, T., Li, N., Zhang, J., Molloy, I.: Slicing: a new approach for privacy preserving data publishing. Trans. Knowl. Data Eng. 24(3), 561–574 (2012)

    Article  Google Scholar 

  35. Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: -diversity: privacy beyond k-anonymity. Trans. Knowl. Discov. Data 1(1), 24–35 (2007)

    Google Scholar 

  36. Malin, B., Benitez, K., Masys, D.: Never too old for anonymity: a statistical standard for demographic data sharing via the HIPAA privacy rule. J. Am. Med. Inform. Assoc. 18(1), 3–10 (2011)

    Article  Google Scholar 

  37. McSherry, F.: Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In: Proceedings of International Conference on Management of Data, pp. 19–30 (2009)

    Google Scholar 

  38. Minka, T.: Lightspeed Matlab toolbox. http://research.microsoft.com/en-us/um/people/minka/software/lightspeed/. Accessed 22 Apr (2014)

  39. Narayanan, A., Felten, E.: No silver bullet: de-identification still doesn’t work. http://randomwalker.info/publications/no-silver-bullet-de-identification.pdf (2014). Accessed 06 May (2015)

  40. Nergiz, M.E., Atzori, M., Clifton, C.: Hiding the presence of individuals from shared data-bases. In: Proceedings of International Conference on Management of Data, pp. 665–676 (2007)

    Google Scholar 

  41. Poulis, G., Loukides, G., Gkoulalas-Divanis, A., Skiadopoulos, S.: Anonymizing data with relational and transaction attributes. In: Proceedings of ECML PKDD, pp. 353–369 (2013)

    Google Scholar 

  42. Poulis, G., Gkoulalas-Divanis, A., Loukides, G., Skiadopoulos, S., Tryfonopoulos, C.: SECRETA: a system for evaluating and comparing relational and transaction anonymization algorithms. In: Proceedings of International Conference on Extending Database Technology, pp. 620–623 (2014)

    Google Scholar 

  43. Prasser, F., Kohlmayer, F.: A simple benchmark of risk-based anonymization with ARX. https://www.github.com/arx-deidentifier/risk-benchmark. Accessed 22 Apr (2015)

  44. Prasser, F., Kohlmayer, F., Kuhn, K.A.: A benchmark of globally-optimal anonymization methods for biomedical data. In: Proceedings of International Symposium on Computer-Based Medical Systems (2014).

    Book  Google Scholar 

  45. Prasser, F., Kohlmayer, F., Lautenschlaeger, R., Eckert, C., Kuhn, K.A.: ARX: a comprehensive tool for anonymizing biomedical data. In: AMIA Annual Symposium Proceedings (2014).

    Google Scholar 

  46. Privacy Analytics Inc.: About PARAT de-identification software. http://www.privacyanalytics.ca/software/parat/. Accessed 22 Apr (2015)

  47. Rinott, Y.: On models for statistical disclosure risk estimation. In: Proceedings of ECE/Eurostat Work Session on Statistical Data Confidentiality, pp. 275–285 (2003)

    Google Scholar 

  48. Samarati, P., Sweeney, L.: Generalizing data to provide anonymity when disclosing information. In: Proceedings of Symposium on Principles of Database Systems, p. 188 (1998)

    Google Scholar 

  49. Sweeney, L.: Datafly: a system for providing anonymity in medical data. In: Database Security, XI: Status and Prospects, p. 20 (1998)

    Google Scholar 

  50. Sweeney, L.: Computational disclosure control: a primer on data privacy protection. Ph.D. thesis, MIT (2001)

    Google Scholar 

  51. Templ, M.: Statistical disclosure control for microdata using the r-package sdcmicro. Trans. Data Privacy 1(2), 67–85 (2008)

    MathSciNet  Google Scholar 

  52. Terrovitis, M., Mamoulis, N., Kalnis, P.: Privacy-preserving anonymization of set-valued data. In: Proceedings of the VLDB Endowment (2008)

    Google Scholar 

  53. U.S. Health Insurance Portability and Accountability Act of 1996. Public Law 1-349 (1996)

    Google Scholar 

  54. UTD Data Security and Privacy Lab: UTD anonymization toolbox. http://www.cs.utdallas.edu/dspl/cgi-bin/toolbox/index.php. Accessed 10 June (2012)

  55. Wikipedia: Hasse diagram. https://en.wikipedia.org/wiki/Hasse_diagram. Accessed 22 Apr (2015)

  56. Wikipedia: Newton’s method. https://en.wikipedia.org/wiki/Newton’s_method. Accessed 22 Apr (2015)

  57. Wikipedia: Polygamma function. https://en.wikipedia.org/wiki/Polygamma_function. Accessed 22 Apr (2015)

  58. Xiao, X., Tao, Y.: Anatomy: simple and effective privacy preservation. In: Proceedings of the VLDB Endowment, pp. 139–150 (2006)

    Google Scholar 

  59. Xiao, X., Wang, G., Gehrke, J.: Interactive anonymization of sensitive data. In: Proceedings of International Conference on Management of Data, pp. 1051–1054 (2009)

    Google Scholar 

  60. Xu, J., Wang, W., Pei, J., Wang, X., Shi, B., Fu, A.W.: Utility-based anonymization using local recoding. In: Proceedings of International Conference on Knowledge Discovery and Data Mining, pp. 785–790 (2006)

    Google Scholar 

  61. Zayatz, L.V.: Estimation of the percent of unique population elements on a microdata file using the sample. Statistical Research Division Report Number: Census/SRD/RR-91/08 (1991)

    Google Scholar 

Download references

Acknowledgements

The authors would like to express their appreciation to Klaus A. Kuhn for his many helpful and insightful comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fabian Prasser .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Prasser, F., Kohlmayer, F. (2015). Putting Statistical Disclosure Control into Practice: The ARX Data Anonymization Tool. In: Gkoulalas-Divanis, A., Loukides, G. (eds) Medical Data Privacy Handbook. Springer, Cham. https://doi.org/10.1007/978-3-319-23633-9_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-23633-9_6

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-23632-2

  • Online ISBN: 978-3-319-23633-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics