skip to main content
10.1145/3209978.3210008acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

SynTF: Synthetic and Differentially Private Term Frequency Vectors for Privacy-Preserving Text Mining

Published: 27 June 2018 Publication History

Abstract

Text mining and information retrieval techniques have been developed to assist us with analyzing, organizing and retrieving documents with the help of computers. In many cases, it is desirable that the authors of such documents remain anonymous: Search logs can reveal sensitive details about a user, critical articles or messages about a company or government might have severe or fatal consequences for a critic, and negative feedback in customer surveys might negatively impact business relations if they are identified. Simply removing personally identifying information from a document is, however, insufficient to protect the writer's identity: Given some reference texts of suspect authors, so-called authorship attribution methods can reidentfy the author from the text itself. One of the most prominent models to represent documents in many common text mining and information retrieval tasks is the vector space model where each document is represented as a vector, typically containing its term frequencies or related quantities. We therefore propose an automated text anonymization approach that produces synthetic term frequency vectors for the input documents that can be used in lieu of the original vectors. We evaluate our method on an exemplary text classification task and demonstrate that it only has a low impact on its accuracy. In contrast, we show that our method strongly affects authorship attribution techniques to the level that they become infeasible with a much stronger decline in accuracy. Other than previous authorship obfuscation methods, our approach is the first that fulfills differential privacy and hence comes with a provable plausible deniability guarantee.

References

[1]
A. Abbasi and H. Chen . 2008. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems (TOIS) Vol. 26, 2 (2008), 7.
[2]
J. Aberdeen, S. Bayer, R. Yeniterzi, B. Wellner, C. Clark, D. Hanauer, B. Malin, and L. Hirschman . 2010. The MITRE Identification Scrubber Toolkit: design, training, and assessment. International journal of medical informatics Vol. 79, 12 (2010), 849--859.
[3]
S. Afroz, M. Brennan, and R. Greenstadt . 2012. Detecting hoaxes, frauds, and deception in writing style online 2012 IEEE Symposium on Security and Privacy. IEEE, 461--475.
[4]
M. Andrés, N. Bordenabe, K. Chatzikokolakis, and C. Palamidessi . 2013. Geo-indistinguishability: Differential privacy for location-based systems Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security. ACM, 901--914.
[5]
M. Barbaro, T. Zeller, and S. Hansell . 2006. A face is exposed for AOL searcher no. 4417749. New York Times Vol. 9, 2008 (9 August . 2006), 8For.
[6]
S. Busemann, S. Schmeier, and R.G. Arens . 2000. Message classification in the call center. In Proceedings of the sixth conference on Applied natural language processing. Association for Computational Linguistics, 158--165.
[7]
A. Caliskan and R. Greenstadt . 2012. Translate once, translate twice, translate thrice and attribute: Identifying authors and machine translation tools in translated text Semantic Computing (ICSC), 2012 IEEE Sixth International Conference on. IEEE, 121--125.
[8]
K. Chatzikokolakis, M. Andrés, N. Bordenabe, and C. Palamidessi . 2013. Broadening the scope of differential privacy using metrics International Symposium on Privacy Enhancing Technologies Symposium. Springer, 82--102.
[9]
Y.-A. De Montjoye, C.A. Hidalgo, M. Verleysen, and V.D. Blondel . 2013. Unique in the crowd: The privacy bounds of human mobility. Scientific reports Vol. 3 (2013).
[10]
C. Dwork . 2008. Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation. Springer, 1--19.
[11]
C. Dwork, F. McSherry, K. Nissim, and A. Smith . 2006. Calibrating noise to sensitivity in private data analysis Theory of Cryptography Conference. Springer, 265--284.
[12]
C. Dwork, A. Roth, et almbox. . 2014. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science Vol. 9, 3--4 (2014), 211--407.
[13]
Ú. Erlingsson, V. Pihur, and A. Korolova . 2014. Rappor: Randomized aggregatable privacy-preserving ordinal response Proceedings of the 2014 ACM SIGSAC conference on computer and communications security. ACM, 1054--1067.
[14]
M. Hay, C. Li, G. Miklau, and D. Jensen . 2009. Accurate estimation of the degree distribution of private networks Data Mining, 2009. ICDM'09. Ninth IEEE International Conference on. IEEE, 169--178.
[15]
X. He, A. Machanavajjhala, and B. Ding . 2014. Blowfish privacy: Tuning privacy-utility trade-offs using policies Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 1447--1458.
[16]
M. Jawurek, M. Johns, and K. Rieck . 2011. Smart metering de-pseudonymization. In Proceedings of the 27th Annual Computer Security Applications Conference. ACM, 227--236.
[17]
E. Jones, T. Oliphant, P. Peterson, et almbox. . 2001--. SciPy: Open source scientific tools for Python. deftempurl%http://www.scipy.org/ tempurl
[18]
P. Juola, J. Sofko, and P. Brennan . 2006. A prototype for authorship attribution studies. Literary and Linguistic Computing Vol. 21, 2 (2006), 169--178.
[19]
G. Kacmarcik and M. Gamon . 2006. Obfuscating document stylometry to preserve author anonymity Proceedings of the COLING/ACL on Main conference poster sessions. Association for Computational Linguistics, 444--451.
[20]
S.P. Kasiviswanathan, K. Nissim, S. Raskhodnikova, and A. Smith . 2013. Analyzing graphs with node differential privacy. In Theory of Cryptography. Springer, 457--476.
[21]
M. Koppel and J. Schler . 2004. Authorship verification as a one-class classification problem Proceedings of the twenty-first international conference on Machine learning. ACM, 62.
[22]
M. Koppel, J. Schler, and E. Bonchek-Dokow . 2007. Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research Vol. 8, Jun (2007), 1261--1276.
[23]
B. Liu . 2012. Sentiment analysis and opinion mining. Synthesis lectures on human language technologies Vol. 5, 1 (2012), 1--167.
[24]
A. Machanavajjhala, A. Korolova, and A.D. Sarma . 2011. Personalized social recommendations: accurate or private. Proceedings of the VLDB Endowment Vol. 4, 7 (2011), 440--450.
[25]
D. Machanavajjhala, A.and Kifer, J. Abowd, J. Gehrke, and L. Vilhuber . 2008. Privacy: Theory meets practice on the map. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering. IEEE Computer Society, 277--286.
[26]
A.W. McDonald, S. Afroz, A. Caliskan, A. Stolerman, and R. Greenstadt . 2012. Use fewer instances of the letter "i": Toward writing style anonymization International Symposium on Privacy Enhancing Technologies Symposium. Springer, 299--318.
[27]
F. McSherry and K. Talwar . 2007. Mechanism design via differential privacy. In Foundations of Computer Science, 2007. FOCS'07. 48th Annual IEEE Symposium on. IEEE, 94--103.
[28]
T. Mikolov, K. Chen, G. Corrado, and J. Dean . 2013 a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[29]
T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, and J. Dean . 2013 b. Distributed representations of words and phrases and their compositionality Advances in neural information processing systems. 3111--3119.
[30]
F. Mosteller and D.L. Wallace . 1963. Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist Papers. J. Amer. Statist. Assoc. Vol. 58, 302 (1963), 275--309.
[31]
A. Narayanan and V. Shmatikov . 2008. Robust de-anonymization of large sparse datasets. In 2008 IEEE Symposium on Security and Privacy (sp 2008). IEEE, 111--125.
[32]
I. Neamatullah, M.M. Douglass, H.L. Li-wei, A. Reisner, M. Villarroel, W.J. Long, P. Szolovits, G.B. Moody, R.G. Mark, and G.D. Clifford . 2008. Automated de-identification of free-text medical records. BMC medical informatics and decision making Vol. 8, 1 (2008), 32.
[33]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay . 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research Vol. 12 (2011), 2825--2830.
[34]
J. Pennington, R. Socher, and C.D. Manning . 2014. Glove: Global Vectors for Word Representation. In EMNLP, Vol. Vol. 14. 1532--43.
[35]
J.R. Rao, P. Rohatgi, et almbox. . 2000. Can pseudonymity really guarantee privacy?. In USENIX Security Symposium. 85--96.
[36]
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz . 1998. A Bayesian approach to filtering junk e-mail. In Learning for Text Categorization: Papers from the 1998 workshop, Vol. Vol. 62. 98--105.
[37]
A. Sala, X. Zhao, C. Wilson, H. Zheng, and B.Y. Zhao . 2011. Sharing graphs using differentially private graph models Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference. ACM, 81--98.
[38]
G. Salton, A. Wong, and C.S. Yang . 1975. A vector space model for automatic indexing. Commun. ACM Vol. 18, 11 (1975), 613--620.
[39]
H. Schütze, C.D. Manning, and P. Raghavan . 2008. Introduction to Information Retrieval. Vol. Vol. 39. Cambridge University Press.
[40]
E. Stamatatos . 2009. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology Vol. 60, 3 (2009), 538--556.
[41]
L. Sweeney . 1996. Replacing personally-identifying information in medical records, the Scrub system. Proceedings of the AMIA annual fall symposium. American Medical Informatics Association, 333.
[42]
L. Sweeney . 2000. Simple demographics often identify people uniquely. Health (San Francisco) Vol. 671 (2000), 1--34.
[43]
U.S. Dept. of Labor, Employee Benefits Security Administration . 1996. The Health Insurance Portability and Accountability Act of 1996 (HIPAA)., bibinfonumpages191 pages.deftempurl%http://www.hhs.gov/hipaa/ tempurl
[44]
Ö. Uzuner, Y. Luo, and P. Szolovits . 2007. Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association Vol. 14, 5 (2007), 550--563.
[45]
S.v.d. Walt, S.C. Colbert, and G. Varoquaux . 2011. The NumPy array: a structure for efficient numerical computation. Computing in Science & Engineering Vol. 13, 2 (2011), 22--30.
[46]
B. Weggenmann and F. Kerschbaum . 2018. SynTF: Synthetic and Differentially Private Term Frequency Vectors for Privacy-Preserving Text Mining. arXiv preprint arXiv:1805.00904 (2018). deftempurl%http://arxiv.org/abs/1805.00904 tempurl
[47]
J. Xu, Z. Zhang, X. Xiao, Y. Yang, G. Yu, and M. Winslett . 2013. Differentially private histogram publication. The VLDB Journal Vol. 22, 6 (2013), 797--822.

Cited By

View all
  • (2024)Just Rewrite It Again: A Post-Processing Method for Enhanced Semantic Similarity and Privacy Preservation of Differentially Private Rewritten TextProceedings of the 19th International Conference on Availability, Reliability and Security10.1145/3664476.3669926(1-11)Online publication date: 30-Jul-2024
  • (2024)1-Diffractor: Efficient and Utility-Preserving Text Obfuscation Leveraging Word-Level Metric Differential PrivacyProceedings of the 10th ACM International Workshop on Security and Privacy Analytics10.1145/3643651.3659896(23-33)Online publication date: 21-Jun-2024
  • (2024)Silencing the Risk, Not the Whistle: A Semi-automated Text Sanitization Tool for Mitigating the Risk of Whistleblower Re-IdentificationProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658936(733-745)Online publication date: 3-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval
June 2018
1509 pages
ISBN:9781450356572
DOI:10.1145/3209978
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. anonymization
  2. authorship attribution
  3. authorship obfuscation
  4. differential privacy
  5. synthetic data
  6. text classification
  7. text mining

Qualifiers

  • Research-article

Funding Sources

  • European Union

Conference

SIGIR '18
Sponsor:

Acceptance Rates

SIGIR '18 Paper Acceptance Rate 86 of 409 submissions, 21%;
Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)39
  • Downloads (Last 6 weeks)1
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Just Rewrite It Again: A Post-Processing Method for Enhanced Semantic Similarity and Privacy Preservation of Differentially Private Rewritten TextProceedings of the 19th International Conference on Availability, Reliability and Security10.1145/3664476.3669926(1-11)Online publication date: 30-Jul-2024
  • (2024)1-Diffractor: Efficient and Utility-Preserving Text Obfuscation Leveraging Word-Level Metric Differential PrivacyProceedings of the 10th ACM International Workshop on Security and Privacy Analytics10.1145/3643651.3659896(23-33)Online publication date: 21-Jun-2024
  • (2024)Silencing the Risk, Not the Whistle: A Semi-automated Text Sanitization Tool for Mitigating the Risk of Whistleblower Re-IdentificationProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658936(733-745)Online publication date: 3-Jun-2024
  • (2024)Textual Differential Privacy for Context-Aware Reasoning with Large Language Model2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC61105.2024.00135(988-997)Online publication date: 2-Jul-2024
  • (2024)Privacy-preserving data integration and sharing in multi-party IoT environments: An entity embedding perspectiveInformation Fusion10.1016/j.inffus.2024.102380108(102380)Online publication date: Aug-2024
  • (2024)PriMonitor: An adaptive tuning privacy-preserving approach for multimodal emotion detectionWorld Wide Web10.1007/s11280-024-01246-727:2Online publication date: 2-Feb-2024
  • (2024)Stock price nowcasting and forecasting with deep learningJournal of Intelligent Information Systems10.1007/s10844-024-00908-2Online publication date: 13-Nov-2024
  • (2022)On the Privacy–Utility Trade-Off in Differentially Private Hierarchical Text ClassificationApplied Sciences10.3390/app12211117712:21(11177)Online publication date: 4-Nov-2022
  • (2022)A Model-Agnostic Approach to Differentially Private Topic MiningProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539417(1835-1845)Online publication date: 14-Aug-2022
  • (2022)DP-VAE: Human-Readable Text Anonymization for Online Reviews with Differentially Private Variational AutoencodersProceedings of the ACM Web Conference 202210.1145/3485447.3512232(721-731)Online publication date: 25-Apr-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media