research-article

SynTF: Synthetic and Differentially Private Term Frequency Vectors for Privacy-Preserving Text Mining

Authors:

Benjamin Weggenmann,

Florian KerschbaumAuthors Info & Claims

SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

Pages 305 - 314

https://doi.org/10.1145/3209978.3210008

Published: 27 June 2018 Publication History

Get Access

Abstract

Text mining and information retrieval techniques have been developed to assist us with analyzing, organizing and retrieving documents with the help of computers. In many cases, it is desirable that the authors of such documents remain anonymous: Search logs can reveal sensitive details about a user, critical articles or messages about a company or government might have severe or fatal consequences for a critic, and negative feedback in customer surveys might negatively impact business relations if they are identified. Simply removing personally identifying information from a document is, however, insufficient to protect the writer's identity: Given some reference texts of suspect authors, so-called authorship attribution methods can reidentfy the author from the text itself. One of the most prominent models to represent documents in many common text mining and information retrieval tasks is the vector space model where each document is represented as a vector, typically containing its term frequencies or related quantities. We therefore propose an automated text anonymization approach that produces synthetic term frequency vectors for the input documents that can be used in lieu of the original vectors. We evaluate our method on an exemplary text classification task and demonstrate that it only has a low impact on its accuracy. In contrast, we show that our method strongly affects authorship attribution techniques to the level that they become infeasible with a much stronger decline in accuracy. Other than previous authorship obfuscation methods, our approach is the first that fulfills differential privacy and hence comes with a provable plausible deniability guarantee.

References

[1]

A. Abbasi and H. Chen . 2008. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems (TOIS) Vol. 26, 2 (2008), 7.

Abstract

References

Cited By

Index Terms

Recommendations

DP-VAE: Human-Readable Text Anonymization for Online Reviews with Differentially Private Variational Autoencoders

Differentially private data publishing via optimal univariate microaggregation and record perturbation

ϵ-Differential Privacy for Microdata Releases Does Not Guarantee Confidentiality (Let Alone Utility)

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations

$ϵ$ -Differential Privacy for Microdata Releases Does Not Guarantee Confidentiality (Let Alone Utility)