skip to main content
10.1145/3132847.3132951acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article
Public Access

Building a Dossier on the Cheap: Integrating Distributed Personal Data Resources Under Cost Constraints

Published:06 November 2017Publication History

ABSTRACT

A wide variety of personal data is routinely collected by numerous organizations that, in turn, share and sell their collections for analytic investigations (e.g., market research). To preserve privacy, certain identifiers are often redacted, perturbed or even removed. A substantial number of attacks have shown that, if care is not taken, such data can be linked to external resources to determine the explicit identifiers (e.g., personal names) or infer sensitive attributes (e.g., income) for the individuals from whom the data was collected. As such, organizations increasingly rely upon record linkage methods to assess the risk such attacks pose and adopt countermeasures accordingly. Traditional linkage methods assume only two datasets would be linked (e.g., linking de-identified hospital discharge to identified voter registration lists), but with the advent of a multi-billion dollar data broker industry, modern adversaries have access to a massive data stash of multiple datasets that can be leveraged. Still, realistic adversaries have budget constraints that prevent them from obtaining and integrating all relevant datasets. Thus, in this work, we investigate a novel privacy risk assessment framework, based on adversaries who plan an integration of datasets for the most accurate estimate of targeted sensitive attributes under a certain budget. To solve this problem, we introduce a graph-based formulation of the problem and predictive modeling methods to prioritize data resources for linkage. We perform an empirical analysis using real world voter registration data from two different U.S. states and show that the methods can be used efficiently to accurately estimate potentially sensitive information disclosure risks even under a non-trivial amount of noise.

References

  1. Adam Tanner. 2016. How Data Brokers Make Money Off Your Medical Records. https://www.scientificamerican.com/article/how-data-brokers-makemoney-off-your-medical-records/.Google ScholarGoogle Scholar
  2. Federal Trade Commission. 2014. Data Brokers: A Call for Transparency and Accountability. https://www.ftc.gov/system/files/documents/reports/databrokers-call-transparency-accountability-report-federal-trade-commissionmay-2014/140527databrokerreport.pdf.Google ScholarGoogle Scholar
  3. Florida Voter Registration Records. http://flvoters.com/downloads.html.Google ScholarGoogle Scholar
  4. Latanya Sweeney. 2005. Risk Assessments of Personal Identification Technologies for Domestic Violence Homeless Shelters. Technical Report Carnegie Mellon University-ISRI-05--133, School of Computer Science, Carnegie Mellon University.Google ScholarGoogle Scholar
  5. North Carolina Voter Registration Records. http://dl.ncsbe.gov/index.html?prefix=data/Snapshots.Google ScholarGoogle Scholar
  6. theDataMap. http://www.thedatamap.org/.Google ScholarGoogle Scholar
  7. European Medicines Agency. http://www.ema.europa.eu/ema/.Google ScholarGoogle Scholar
  8. European Union General Data Protection Regulation. http://data.consilium. europa.eu/doc/document/ST-9565--2015-INIT/en/pdf.Google ScholarGoogle Scholar
  9. Michael Barbaro, Tom Zeller, and Saul Hansell. 2006. A Face is Exposed for AOL Searcher No. 4417749. New York Times 9, 2008 (2006), 8.Google ScholarGoogle Scholar
  10. Mikhail Bilenko and Raymond J Mooney. 2003. Adaptive Duplicate Dete ction Using Learnable String Similarity Measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2003). ACM, 39--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Andrei Z Broder. 1997. On the Resemblance and Containment of Documents. In Proceedings of the Compression and Complexity of Sequences 1997. IEEE, 21--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Douglas Burdick, Mauricio A. Hernandez, Howard Ho, Georgia Koutrika, Rajasekar Krishnamurthy, Lucian Popa, Ioana Stanoi, Shivakumar Vaithyanathan, and Sanjiv R. Das. 2011. Extracting, Linking and Integrating Data from Public Sources: A Financial Case Study. IEEE Data Eng. Bull. 34, 3 (2011), 60--67.Google ScholarGoogle Scholar
  13. Peter Christen. 2012. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yves-Alexandre De Montjoye, Cesar A Hidalgo, Michel Verleysen, and Vincent D. Blondel. 2013. Unique in the crowd: The privacy bounds of human mobility. Nature Scientific Reports 3 (2013), 1376.Google ScholarGoogle ScholarCross RefCross Ref
  15. Josep Domingo-Ferrer, Sara Ricci, and Jordi Soria-Comas. 2015. Disclosure Risk Assessment via Record Linkage by a Maximum-Knowledge Attacker. In 13th Annual Conference on Privacy, Security and Trust (PST). IEEE, 28--35.Google ScholarGoogle ScholarCross RefCross Ref
  16. Xin Luna Dong, Barna Saha, and Divesh Srivastava. 2012. Less is More: Selecting Sources Wisely for Integration. PVLDB 6, 2 (2012), 37--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate Record Detection: A Survey. IEEE Transaction on Knowledge and Data Engineering (TKDE 2007) 19, 1 (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Virgil Griffith and Markus Jakobsson. 2005. Messin' with Texas Deriving Mother's Maiden Names Using Public Records. In Applied Cryptography and Network Security. Springer, 91--103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Gymrek, A. L. McGuire, D. Golan, E. Halperin, and Y. Erlich. 2013. Identifying Personal Genomes by Surname Inference. Science 339, 6117 (Jan 2013), 321--324.Google ScholarGoogle ScholarCross RefCross Ref
  20. Rashid Hussain Khokhar, Rui Chen, Benjamin CM Fung, and Siu Man Lui. 2014. Quantifying the costs and benefits of privacy-preserving health data publishing. Journal of Biomedical Informatics 50 (2014), 107--121.Google ScholarGoogle ScholarCross RefCross Ref
  21. Furong Li, Mong-Li Lee, and Wynne Hsu. 2014. Entity Profiling with Varying Source Reliabilities. In the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2014). ACM, 1146--1155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Arvind Narayanan and Vitaly Shmatikov. 2009. De-anonymizing Social Networks. In 30th IEEE Symposium on Security and Privacy (S&P 2009). 173--187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. HHS Office for Civil Rights. 2002. Standards for Privacy of Individually Identifiable Health Information. Final rule. Federal Register 67, 157 (2002), 53181.Google ScholarGoogle Scholar
  24. Teruhiko Teraoka. 2012. Organization and exploration of heterogeneous personal data collected in daily life. Human-Centric Computing and Information Sciences 2, 1 (2012), 1.Google ScholarGoogle ScholarCross RefCross Ref
  25. Khoi-Nguyen Tran, Dinusha Vatsalan, and Peter Christen. 2013. GeCo: an online personal data Generator and Corruptor. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management (CIKM 2013). ACM, 2473--2476. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Leslie G Valiant. 1979. The Complexity of Enumeration and Reliability Problems. SIAM J. Comput. 8, 3 (1979), 410--421.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Zhiyu Wan, Yevgeniy Vorobeychik, Weiyi Xia, Ellen Wright Clayton, Murat Kantarcioglu, Ranjit Ganta, Raymond Heatherly, and Bradley A Malin. 2015. A Game Theoretic Framework for Analyzing Re-Identification Risk. PloS One 10, 3 (2015).Google ScholarGoogle Scholar
  28. Steven Euijong Whang and Hector Garcia-Molina. 2013. Joint Entity Resolution on Multiple Datasets. VLDB J. 22, 6 (2013), 773--795. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Weiyi Xia, Murat Kantarcioglu, Zhiyu Wan, Raymond Heatherly, Yevgeniy Vorobeychik, and Bradley Malin. 2015. Process-Driven Data Privacy. In Proceedings of the 24th ACM International Conference on Information and Knowledge Management (CIKM 2015). 1021--1030. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Building a Dossier on the Cheap: Integrating Distributed Personal Data Resources Under Cost Constraints

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
          November 2017
          2604 pages
          ISBN:9781450349185
          DOI:10.1145/3132847

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 6 November 2017

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          CIKM '17 Paper Acceptance Rate171of855submissions,20%Overall Acceptance Rate1,861of8,427submissions,22%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader