ABSTRACT
A wide variety of personal data is routinely collected by numerous organizations that, in turn, share and sell their collections for analytic investigations (e.g., market research). To preserve privacy, certain identifiers are often redacted, perturbed or even removed. A substantial number of attacks have shown that, if care is not taken, such data can be linked to external resources to determine the explicit identifiers (e.g., personal names) or infer sensitive attributes (e.g., income) for the individuals from whom the data was collected. As such, organizations increasingly rely upon record linkage methods to assess the risk such attacks pose and adopt countermeasures accordingly. Traditional linkage methods assume only two datasets would be linked (e.g., linking de-identified hospital discharge to identified voter registration lists), but with the advent of a multi-billion dollar data broker industry, modern adversaries have access to a massive data stash of multiple datasets that can be leveraged. Still, realistic adversaries have budget constraints that prevent them from obtaining and integrating all relevant datasets. Thus, in this work, we investigate a novel privacy risk assessment framework, based on adversaries who plan an integration of datasets for the most accurate estimate of targeted sensitive attributes under a certain budget. To solve this problem, we introduce a graph-based formulation of the problem and predictive modeling methods to prioritize data resources for linkage. We perform an empirical analysis using real world voter registration data from two different U.S. states and show that the methods can be used efficiently to accurately estimate potentially sensitive information disclosure risks even under a non-trivial amount of noise.
- Adam Tanner. 2016. How Data Brokers Make Money Off Your Medical Records. https://www.scientificamerican.com/article/how-data-brokers-makemoney-off-your-medical-records/.Google Scholar
- Federal Trade Commission. 2014. Data Brokers: A Call for Transparency and Accountability. https://www.ftc.gov/system/files/documents/reports/databrokers-call-transparency-accountability-report-federal-trade-commissionmay-2014/140527databrokerreport.pdf.Google Scholar
- Florida Voter Registration Records. http://flvoters.com/downloads.html.Google Scholar
- Latanya Sweeney. 2005. Risk Assessments of Personal Identification Technologies for Domestic Violence Homeless Shelters. Technical Report Carnegie Mellon University-ISRI-05--133, School of Computer Science, Carnegie Mellon University.Google Scholar
- North Carolina Voter Registration Records. http://dl.ncsbe.gov/index.html?prefix=data/Snapshots.Google Scholar
- theDataMap. http://www.thedatamap.org/.Google Scholar
- European Medicines Agency. http://www.ema.europa.eu/ema/.Google Scholar
- European Union General Data Protection Regulation. http://data.consilium. europa.eu/doc/document/ST-9565--2015-INIT/en/pdf.Google Scholar
- Michael Barbaro, Tom Zeller, and Saul Hansell. 2006. A Face is Exposed for AOL Searcher No. 4417749. New York Times 9, 2008 (2006), 8.Google Scholar
- Mikhail Bilenko and Raymond J Mooney. 2003. Adaptive Duplicate Dete ction Using Learnable String Similarity Measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2003). ACM, 39--48. Google ScholarDigital Library
- Andrei Z Broder. 1997. On the Resemblance and Containment of Documents. In Proceedings of the Compression and Complexity of Sequences 1997. IEEE, 21--29. Google ScholarDigital Library
- Douglas Burdick, Mauricio A. Hernandez, Howard Ho, Georgia Koutrika, Rajasekar Krishnamurthy, Lucian Popa, Ioana Stanoi, Shivakumar Vaithyanathan, and Sanjiv R. Das. 2011. Extracting, Linking and Integrating Data from Public Sources: A Financial Case Study. IEEE Data Eng. Bull. 34, 3 (2011), 60--67.Google Scholar
- Peter Christen. 2012. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer. Google ScholarDigital Library
- Yves-Alexandre De Montjoye, Cesar A Hidalgo, Michel Verleysen, and Vincent D. Blondel. 2013. Unique in the crowd: The privacy bounds of human mobility. Nature Scientific Reports 3 (2013), 1376.Google ScholarCross Ref
- Josep Domingo-Ferrer, Sara Ricci, and Jordi Soria-Comas. 2015. Disclosure Risk Assessment via Record Linkage by a Maximum-Knowledge Attacker. In 13th Annual Conference on Privacy, Security and Trust (PST). IEEE, 28--35.Google ScholarCross Ref
- Xin Luna Dong, Barna Saha, and Divesh Srivastava. 2012. Less is More: Selecting Sources Wisely for Integration. PVLDB 6, 2 (2012), 37--48. Google ScholarDigital Library
- Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate Record Detection: A Survey. IEEE Transaction on Knowledge and Data Engineering (TKDE 2007) 19, 1 (2007). Google ScholarDigital Library
- Virgil Griffith and Markus Jakobsson. 2005. Messin' with Texas Deriving Mother's Maiden Names Using Public Records. In Applied Cryptography and Network Security. Springer, 91--103. Google ScholarDigital Library
- M. Gymrek, A. L. McGuire, D. Golan, E. Halperin, and Y. Erlich. 2013. Identifying Personal Genomes by Surname Inference. Science 339, 6117 (Jan 2013), 321--324.Google ScholarCross Ref
- Rashid Hussain Khokhar, Rui Chen, Benjamin CM Fung, and Siu Man Lui. 2014. Quantifying the costs and benefits of privacy-preserving health data publishing. Journal of Biomedical Informatics 50 (2014), 107--121.Google ScholarCross Ref
- Furong Li, Mong-Li Lee, and Wynne Hsu. 2014. Entity Profiling with Varying Source Reliabilities. In the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2014). ACM, 1146--1155. Google ScholarDigital Library
- Arvind Narayanan and Vitaly Shmatikov. 2009. De-anonymizing Social Networks. In 30th IEEE Symposium on Security and Privacy (S&P 2009). 173--187. Google ScholarDigital Library
- HHS Office for Civil Rights. 2002. Standards for Privacy of Individually Identifiable Health Information. Final rule. Federal Register 67, 157 (2002), 53181.Google Scholar
- Teruhiko Teraoka. 2012. Organization and exploration of heterogeneous personal data collected in daily life. Human-Centric Computing and Information Sciences 2, 1 (2012), 1.Google ScholarCross Ref
- Khoi-Nguyen Tran, Dinusha Vatsalan, and Peter Christen. 2013. GeCo: an online personal data Generator and Corruptor. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management (CIKM 2013). ACM, 2473--2476. Google ScholarDigital Library
- Leslie G Valiant. 1979. The Complexity of Enumeration and Reliability Problems. SIAM J. Comput. 8, 3 (1979), 410--421.Google ScholarDigital Library
- Zhiyu Wan, Yevgeniy Vorobeychik, Weiyi Xia, Ellen Wright Clayton, Murat Kantarcioglu, Ranjit Ganta, Raymond Heatherly, and Bradley A Malin. 2015. A Game Theoretic Framework for Analyzing Re-Identification Risk. PloS One 10, 3 (2015).Google Scholar
- Steven Euijong Whang and Hector Garcia-Molina. 2013. Joint Entity Resolution on Multiple Datasets. VLDB J. 22, 6 (2013), 773--795. Google ScholarDigital Library
- Weiyi Xia, Murat Kantarcioglu, Zhiyu Wan, Raymond Heatherly, Yevgeniy Vorobeychik, and Bradley Malin. 2015. Process-Driven Data Privacy. In Proceedings of the 24th ACM International Conference on Information and Knowledge Management (CIKM 2015). 1021--1030. Google ScholarDigital Library
Index Terms
- Building a Dossier on the Cheap: Integrating Distributed Personal Data Resources Under Cost Constraints
Recommendations
Flexible adversary disclosure risk measure for identity and attribute disclosure attacks
AbstractIndividuals generate tremendous amount of personal data each day, with a wide variety of uses. This datum often contains sensitive information about individuals, which can be disclosed by “adversaries”. Even when direct identifiers such as social ...
Comparisons of randomization and K-degree anonymization schemes for privacy preserving social network publishing
SNA-KDD '09: Proceedings of the 3rd Workshop on Social Network Mining and AnalysisMany applications of social networks require identity and/or relationship anonymity due to the sensitive, stigmatizing, or confidential nature of user identities and their behaviors. Recent work showed that the simple technique of anonymizing graphs by ...
Identity disclosure protection: A data reconstruction approach for privacy-preserving data mining
Identity disclosure is one of the most serious privacy concerns in today's information age. A well-known method for protecting identity disclosure is k-anonymity. A dataset provides k-anonymity protection if the information for each individual in the ...
Comments