Building a Dossier on the Cheap: Integrating Distributed Personal Data Resources Under Cost Constraints

Authors:
Imrul Chowdhury Anindya

University of Texas at Dallas, Dallas, TX, USA

University of Texas at Dallas, Dallas, TX, USA
View Profile

,
Harichandan Roy

University of Texas at Dallas, Dallas, TX, USA

University of Texas at Dallas, Dallas, TX, USA
View Profile

,
Murat Kantarcioglu

University of Texas at Dallas, Dallas, TX, USA

University of Texas at Dallas, Dallas, TX, USA
View Profile

,
Bradley Malin

Vanderbilt University, Nashville, TN, USA

Vanderbilt University, Nashville, TN, USA
View Profile

CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge ManagementNovember 2017Pages 1549–1558https://doi.org/10.1145/3132847.3132951

Published:06 November 2017Publication History

CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pages 1549–1558

ABSTRACT

A wide variety of personal data is routinely collected by numerous organizations that, in turn, share and sell their collections for analytic investigations (e.g., market research). To preserve privacy, certain identifiers are often redacted, perturbed or even removed. A substantial number of attacks have shown that, if care is not taken, such data can be linked to external resources to determine the explicit identifiers (e.g., personal names) or infer sensitive attributes (e.g., income) for the individuals from whom the data was collected. As such, organizations increasingly rely upon record linkage methods to assess the risk such attacks pose and adopt countermeasures accordingly. Traditional linkage methods assume only two datasets would be linked (e.g., linking de-identified hospital discharge to identified voter registration lists), but with the advent of a multi-billion dollar data broker industry, modern adversaries have access to a massive data stash of multiple datasets that can be leveraged. Still, realistic adversaries have budget constraints that prevent them from obtaining and integrating all relevant datasets. Thus, in this work, we investigate a novel privacy risk assessment framework, based on adversaries who plan an integration of datasets for the most accurate estimate of targeted sensitive attributes under a certain budget. To solve this problem, we introduce a graph-based formulation of the problem and predictive modeling methods to prioritize data resources for linkage. We perform an empirical analysis using real world voter registration data from two different U.S. states and show that the methods can be used efficiently to accurately estimate potentially sensitive information disclosure risks even under a non-trivial amount of noise.

References

Adam Tanner. 2016. How Data Brokers Make Money Off Your Medical Records. https://www.scientificamerican.com/article/how-data-brokers-makemoney-off-your-medical-records/.Google Scholar
Federal Trade Commission. 2014. Data Brokers: A Call for Transparency and Accountability. https://www.ftc.gov/system/files/documents/reports/databrokers-call-transparency-accountability-report-federal-trade-commissionmay-2014/140527databrokerreport.pdf.Google Scholar
Florida Voter Registration Records. http://flvoters.com/downloads.html.Google Scholar
Latanya Sweeney. 2005. Risk Assessments of Personal Identification Technologies for Domestic Violence Homeless Shelters. Technical Report Carnegie Mellon University-ISRI-05--133, School of Computer Science, Carnegie Mellon University.Google Scholar
North Carolina Voter Registration Records. http://dl.ncsbe.gov/index.html?prefix=data/Snapshots.Google Scholar
theDataMap. http://www.thedatamap.org/.Google Scholar
European Medicines Agency. http://www.ema.europa.eu/ema/.Google Scholar
European Union General Data Protection Regulation. http://data.consilium. europa.eu/doc/document/ST-9565--2015-INIT/en/pdf.Google Scholar
Michael Barbaro, Tom Zeller, and Saul Hansell. 2006. A Face is Exposed for AOL Searcher No. 4417749. New York Times 9, 2008 (2006), 8.Google Scholar
Mikhail Bilenko and Raymond J Mooney. 2003. Adaptive Duplicate Dete ction Using Learnable String Similarity Measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2003). ACM, 39--48. Google ScholarDigital Library
Andrei Z Broder. 1997. On the Resemblance and Containment of Documents. In Proceedings of the Compression and Complexity of Sequences 1997. IEEE, 21--29. Google ScholarDigital Library
Douglas Burdick, Mauricio A. Hernandez, Howard Ho, Georgia Koutrika, Rajasekar Krishnamurthy, Lucian Popa, Ioana Stanoi, Shivakumar Vaithyanathan, and Sanjiv R. Das. 2011. Extracting, Linking and Integrating Data from Public Sources: A Financial Case Study. IEEE Data Eng. Bull. 34, 3 (2011), 60--67.Google Scholar
Peter Christen. 2012. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer. Google ScholarDigital Library
Yves-Alexandre De Montjoye, Cesar A Hidalgo, Michel Verleysen, and Vincent D. Blondel. 2013. Unique in the crowd: The privacy bounds of human mobility. Nature Scientific Reports 3 (2013), 1376.Google ScholarCross Ref
Josep Domingo-Ferrer, Sara Ricci, and Jordi Soria-Comas. 2015. Disclosure Risk Assessment via Record Linkage by a Maximum-Knowledge Attacker. In 13th Annual Conference on Privacy, Security and Trust (PST). IEEE, 28--35.Google ScholarCross Ref
Xin Luna Dong, Barna Saha, and Divesh Srivastava. 2012. Less is More: Selecting Sources Wisely for Integration. PVLDB 6, 2 (2012), 37--48. Google ScholarDigital Library
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate Record Detection: A Survey. IEEE Transaction on Knowledge and Data Engineering (TKDE 2007) 19, 1 (2007). Google ScholarDigital Library
Virgil Griffith and Markus Jakobsson. 2005. Messin' with Texas Deriving Mother's Maiden Names Using Public Records. In Applied Cryptography and Network Security. Springer, 91--103. Google ScholarDigital Library
M. Gymrek, A. L. McGuire, D. Golan, E. Halperin, and Y. Erlich. 2013. Identifying Personal Genomes by Surname Inference. Science 339, 6117 (Jan 2013), 321--324.Google ScholarCross Ref
Rashid Hussain Khokhar, Rui Chen, Benjamin CM Fung, and Siu Man Lui. 2014. Quantifying the costs and benefits of privacy-preserving health data publishing. Journal of Biomedical Informatics 50 (2014), 107--121.Google ScholarCross Ref
Furong Li, Mong-Li Lee, and Wynne Hsu. 2014. Entity Profiling with Varying Source Reliabilities. In the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2014). ACM, 1146--1155. Google ScholarDigital Library
Arvind Narayanan and Vitaly Shmatikov. 2009. De-anonymizing Social Networks. In 30th IEEE Symposium on Security and Privacy (S&P 2009). 173--187. Google ScholarDigital Library
HHS Office for Civil Rights. 2002. Standards for Privacy of Individually Identifiable Health Information. Final rule. Federal Register 67, 157 (2002), 53181.Google Scholar
Teruhiko Teraoka. 2012. Organization and exploration of heterogeneous personal data collected in daily life. Human-Centric Computing and Information Sciences 2, 1 (2012), 1.Google ScholarCross Ref
Khoi-Nguyen Tran, Dinusha Vatsalan, and Peter Christen. 2013. GeCo: an online personal data Generator and Corruptor. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management (CIKM 2013). ACM, 2473--2476. Google ScholarDigital Library
Leslie G Valiant. 1979. The Complexity of Enumeration and Reliability Problems. SIAM J. Comput. 8, 3 (1979), 410--421.Google ScholarDigital Library
Zhiyu Wan, Yevgeniy Vorobeychik, Weiyi Xia, Ellen Wright Clayton, Murat Kantarcioglu, Ranjit Ganta, Raymond Heatherly, and Bradley A Malin. 2015. A Game Theoretic Framework for Analyzing Re-Identification Risk. PloS One 10, 3 (2015).Google Scholar
Steven Euijong Whang and Hector Garcia-Molina. 2013. Joint Entity Resolution on Multiple Datasets. VLDB J. 22, 6 (2013), 773--795. Google ScholarDigital Library
Weiyi Xia, Murat Kantarcioglu, Zhiyu Wan, Raymond Heatherly, Yevgeniy Vorobeychik, and Bradley Malin. 2015. Process-Driven Data Privacy. In Proceedings of the 24th ACM International Conference on Information and Knowledge Management (CIKM 2015). 1021--1030. Google ScholarDigital Library

Index Terms

Building a Dossier on the Cheap: Integrating Distributed Personal Data Resources Under Cost Constraints
1. Security and privacy
  1. Human and societal aspects of security and privacy

Recommendations

Flexible adversary disclosure risk measure for identity and attribute disclosure attacks
Abstract
Individuals generate tremendous amount of personal data each day, with a wide variety of uses. This datum often contains sensitive information about individuals, which can be disclosed by “adversaries”. Even when direct identifiers such as social ...
Read More
Comparisons of randomization and K-degree anonymization schemes for privacy preserving social network publishing
SNA-KDD '09: Proceedings of the 3rd Workshop on Social Network Mining and Analysis

Many applications of social networks require identity and/or relationship anonymity due to the sensitive, stigmatizing, or confidential nature of user identities and their behaviors. Recent work showed that the simple technique of anonymizing graphs by ...
Read More
Identity disclosure protection: A data reconstruction approach for privacy-preserving data mining

Identity disclosure is one of the most serious privacy concerns in today's information age. A well-known method for protecting identity disclosure is k-anonymity. A dataset provides k-anonymity protection if the information for each individual in the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
November 2017
2604 pages
ISBN:9781450349185
DOI:10.1145/3132847
General Chairs:
Ee-Peng Lim
Singapore Management University, Singapore
,
Marianne Winslett
University of Illinois at Urbana-Champaign, USA, and Advanced Digital Sciences Center, Singapore
,
Program Chairs:
Mark Sanderson
RMIT, Australia
,
Ada Fu
Chinese University of Hong Kong, Hong Kong
,
Jimeng Sun
Georgia Tech, USA
,
Shane Culpepper
RMIT, Australia
,
Eric Lo
Chinese University of Hong Kong, Hong Kong
,
Joyce Ho
Emory University, USA
,
Debora Donato
Mix Tech, Inc., USA
,
Rakesh Agrawal
Data Insights Laboratories, USA
,
Yu Zheng
Microsoft Research Asia, China
,
Carlos Castillo
Qatar Computing Research Institute, Qatar
,
Aixin Sun
Nanyang Technological University, Singapore
,
Vincent S. Tseng
National Cheng Kung University, Taiwan
,
Chenliang Li
Wuhan University, China
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 November 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data brokers
data integration
data privacy
heuristic approach
identity disclosure
probabilistic model
record linkage
Qualifiers
- research-article
Conference

Acceptance Rates
CIKM '17 Paper Acceptance Rate171of855submissions,20%Overall Acceptance Rate1,861of8,427submissions,22%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 162
  Total Downloads
- Downloads (Last 12 months)17
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Building a Dossier on the Cheap: Integrating Distributed Personal Data Resources Under Cost Constraints

CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Flexible adversary disclosure risk measure for identity and attribute disclosure attacks

Comparisons of randomization and K-degree anonymization schemes for privacy preserving social network publishing

Identity disclosure protection: A data reconstruction approach for privacy-preserving data mining