Skip to main content
Log in

Exposing safe correlations in transactional datasets

  • Special Issue Paper
  • Published:
Service Oriented Computing and Applications Aims and scope Submit manuscript

Abstract

A particularly challenging problem for data anonymization is dealing with transactional data. Most anonymization methods assume homogeneous, independent and identically distributed (i.i.d.) data; “flattening” transactional data to satisfy this model results in wide, sparse data that does not anonymize well with traditional techniques. While there have been some approaches for generalization-based anonymization, bucketization techniques (e.g., anatomy) pose new challenges. In particular, bucketization provides the opportunity to learn correlations between data items, but also a risk of identifying individuals because of dependencies inferred from such correlations. We present a method that balances these issues, retaining the ability to discover correlations in the data, while hiding dependencies that would enable correlations to be used to link specific values to individuals. We introduce a correlation anonymization constraint that ensures correlations do not allow data to be linked to a specific individual, and an elastic safe grouping algorithm that meets this constraint while preserving data correlations. We evaluate the utility loss on a transactional rental dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. We assume User ID and Vin Number are independent and identically distributed (i.i.d.).

  2. Some data is anonymized/suppressed in order to meet the constraint; this is in keeping with privacy models that uses partial suppression by replacing individual’s values with a * to preserve privacy as in [34, 42] or encryption as in the model in [36] where some data is left encrypted, and only “safe” data is revealed.

  3. https://github.com/ElieChicha/ESC/blob/main/sourcedata.xlsx.

References

  1. Abadi M, Chu A, Goodfellow I, McMahan HB, Mironov I, Talwar K, Li Z (2016) Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp 308–318

  2. Aldous DJ (1985) Exchangeability and related topics. In: École d’été de probabilités de Saint-Flour, XIII—1983, volume 1117 of Lecture Notes in Math. Springer, Berlin, pp 1–198

  3. Andrés ME, Bordenabe NE, Chatzikokolakis K, Palamidessi C (2013) Geo-indistinguishability: differential privacy for location-based systems. In: Proceedings of the 2013 ACM SIGSAC conference on computer and communications security, pp 901–914

  4. Anjum A, Raschia G (2017) Banga: an efficient and flexible generalization-based algorithm for privacy preserving data publication. Computers 6(1):1

    Article  Google Scholar 

  5. Biskup J, PreuB M, Wiese L (2011) On the inference-proofness of database fragmentation satisfying confidentiality constraints. In: Proceedings of the 14th information security conference, Xian, China, oct 26–29

  6. Bouna BA, Clifton C, Malluhi QM (2013) Using safety constraint for transactional dataset anonymization. In: DBSec, pp 164–178

  7. Bouna BA, Clifton C, Malluhi QM (2015) Efficient sanitization of unsafe data correlations. In: Proceedings of the workshops of the EDBT/ICDT 2015 joint conference (EDBT/ICDT), Brussels, Belgium, March 27th, 2015, pp 278–285

  8. Bouna B, Clifton C, Malluhi QM (2015) Anonymizing transactional datasets. J Comput Secur 23(1):89–106

    Article  Google Scholar 

  9. Centers for Medicare & Medicaid Services (1996) The Health Insurance Portability and Accountability Act of 1996 (HIPAA). http://www.cms.hhs.gov/hipaa/

  10. Chicha E, Bouna BA, Nassar M, Chbeir R (2018) Cloud-based differentially private image classification. Wirel Netw 99:1–8

    Google Scholar 

  11. Chicha E, Bouna BA, Nassar M, Chbeir R, Haraty RA, Oussalah M, Benslimane D, Alraja MN (2021) A user-centric mechanism for sequentially releasing graph datasets under blowfish privacy. ACM Trans Internet Technol TOIT 21(1):1–25

    Article  Google Scholar 

  12. Ciriani V, De Vimercati Sabrina CD, Foresti S, Jajodia S, Paraboschi S, Samarati P (2010) Combining fragmentation and encryption to protect privacy in data storage. ACM Trans Inf Syst Secur 13:22:1-22:33

    Article  Google Scholar 

  13. Dai C, Ghinita G, Bertino E, Byun J-W, Li N (2009) Tiamat: a tool for interactive analysis of microdata anonymization techniques. Proc VLDB Endow 2(2):1618–1621

    Article  Google Scholar 

  14. Domingo-Ferrer J, Soria-Comas J (2015) From t-closeness to differential privacy and vice versa in data anonymization. Knowl Based Syst 74:151–158

    Article  Google Scholar 

  15. Dwork C (2008) Differential privacy: a survey of results. In: International conference on theory and applications of models of computation. Springer, pp. 1–19

  16. Dwork C, Roth A et al (2014) The algorithmic foundations of differential privacy. Found Trends Theor Comput Sci 9(3–4):211–407

    MathSciNet  MATH  Google Scholar 

  17. Dwork C, McSherry F, Nissim K, Smith A (2006) Calibrating noise to sensitivity in private data analysis. In: Proceedings of the third conference on theory of cryptography, TCC’06. Springer, Berlin, Heidelberg, pp 265–284

  18. Fatemeh ANY, Azadeh S (2018) Bottom-up sequential anonymization in the presence of adversary knowledge. Inf Sci 405:316–335

    MathSciNet  Google Scholar 

  19. Friedman A, Schuster A (2010) Data mining with differential privacy. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, pp 493–502

  20. Fung BCM, Wang K, Chen R, Yu PS (2010) Privacy-preserving data publishing: a survey of recent developments. ACM Comput Surv (Csur) 42(4):1–53

    Article  Google Scholar 

  21. Gong Q, Luo J, Yang M, Ni W, Xo-B Li (2017) Anonymizing 1:m microdata with high utility. Knowl Based Syst 115(Supplement(Supplement C):15–26

    Article  Google Scholar 

  22. Gong Q, Yang M, Chen Z, Wenjia W, Luo J (2017) A framework for utility enhanced incomplete microdata anonymization. Clust Comput 20(2):1749–1764

    Article  Google Scholar 

  23. He X, Machanavajjhala A, Ding B (2014) Blowfish privacy: tuning privacy-utility trade-offs using policies. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data. ACM, pp 1447–1458

  24. Hundepool A, Willenborg LCRJ (1996) \(\mu \)-and \(\tau \)-argus: software for statistical disclosure control. In: Third international seminar on statistical confidentiality

  25. Kantarcioglu M, Inan A, Kuzu M (2010) Anonymization toolbox

  26. Kifer D (2009) Attacks on privacy and Definetti’s theorem. In: SIGMOD conference, pp 127–138

  27. LeFevre K, DeWitt DJ, Ramakrishnan R (2005) Incognito: efficient full-domain k-anonymity. In: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pp 49–60

  28. LeFevre K, DeWitt DJ, Ramakrishnan R (2006) Workload-aware anonymization. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 277–286

  29. Li T, Li N, Zhang J, Molloy I (2012) Slicing: a new approach for privacy preserving data publishing. IEEE Trans Knowl Data Eng 24(3):561–574

    Article  Google Scholar 

  30. Li B, Liu Y, Han X, Zhang J (2018) Cross-bucket generalization for information and privacy preservation. IEEE Trans Knowl Data Eng 30(3):449–459

    Article  Google Scholar 

  31. Li T, Li N (2008) Injector: mining background knowledge for data anonymization. In: tICDE, pp 446–455

  32. Li N, Li T, Venkatasubramanian S (2007) t-closeness: privacy beyond k-anonymity and l-diversity. In: ICDE, pp 106–115

  33. Li N, Qardaji W, Su D (2012) On sampling, anonymization, and differential privacy or, k-anonymization meets differential privacy. In: Proceedings of the 7th ACM symposium on information, computer and communications security, ASIACCS ’12. ACM, New York, pp 32–33

  34. Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M (2006) \(l\)-diversity: privacy beyond \(k\)-anonymity. In: Proceedings of the 22nd IEEE international conference on data engineering (ICDE 2006), Atlanta Georgia

  35. Nassar M, Chicha E, Bouna BA, Chbeir R (2020) Vip blowfish privacy in communication graphs. In: ICETE (2), pp 459–467

  36. Nergiz AE, Clifton C (2011) Query processing in private data outsourcing using anonymization. In: The 25th IFIP WG 11.3 conference on data and applications security and privacy (DBSEC-11), Richmond, Virginia

  37. Nergiz ME, Atzori M, Clifton C (2007) Hiding the presence of individuals from shared databases. In: Proceedings of the 2007 ACM SIGMOD international conference on management of data, pp 665–676

  38. Prasser F, Kohlmayer F, Lautenschläger R, Kuhn KA (2014) Arx-a comprehensive tool for anonymizing biomedical data. In: AMIA annual symposium proceedings, vol 2014. American Medical Informatics Association, p 984

  39. Ressel P (1985) De Finetti-type theorems: an analytical approach. Ann Probab 13(3):898–922

    Article  MathSciNet  Google Scholar 

  40. Samarati P (2001) Protecting respondents’ identities in microdata release. IEEE Trans Knowl Data Eng 13(6):1010–1027

  41. Soria-Comas J, Domingo-Ferrer J (2013) Differential privacy via t-closeness in data publishing. In: Eleventh annual international conference on privacy, security and trust, PST 2013, 10–12 July, 2013, Tarragona, Catalonia, Spain, July 10–12, 2013, pp 27–35

  42. Sweeney L (2002) k-anonymity: a model for protecting privacy. Int J Uncertain Fuzzin Knowl Based Syst 10(5):557–570

    Article  MathSciNet  Google Scholar 

  43. Wang H, Liu R (2015) Hiding outliers into crowd: privacy-preserving data publishing with outliers. Data Knowl Eng 100(Part A):94–115

    Article  Google Scholar 

  44. Wang K, Wang P, Fu Ada W, Wong RC-W (2016) Generalized bucketization scheme for flexible privacy settings. Inf Sci 348:377–393

    Article  MathSciNet  Google Scholar 

  45. Wong RC-W, Fu Ada W-C, Wang K, Yu PS, Pei J (2011) Can the utility of anonymized data be used for privacy breaches? ACM Trans Knowl Discov Data 5(3):16:1-16:24

    Article  Google Scholar 

  46. Xiao X, Tao Y (2006) Anatomy: simple and effective privacy preservation. In: Proceedings of 32nd international conference on very large data bases (VLDB 2006), Seoul, Korea

Download references

Acknowledgements

The authors would like to acknowledge the National Council for Scientific Research of Lebanon (CNRS-L) and Univ. Pau & Pays Adour, UPPA-E2S, LIUPPA, for granting a doctoral fellowship to Elie Chicha.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elie Chicha.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chicha, E., Al Bouna, B., Wünsche, K. et al. Exposing safe correlations in transactional datasets. SOCA 15, 289–307 (2021). https://doi.org/10.1007/s11761-021-00325-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11761-021-00325-1

Navigation