research-article

Unsupervised Topic Extraction from Privacy Policies

Authors:
David Sarne

Bar-Ilan Univ.

Bar-Ilan Univ.
View Profile

,
Jonathan Schler

Bar-Ilan Univ.

Bar-Ilan Univ.
View Profile

,
Alon Singer

Bar-Ilan Univ.

Bar-Ilan Univ.
View Profile

,
Ayelet Sela

Bar-Ilan Univ.

Bar-Ilan Univ.
View Profile

,
Ittai Bar Siman Tov

Bar-Ilan Univ.

Bar-Ilan Univ.
View Profile

WWW '19: Companion Proceedings of The 2019 World Wide Web ConferenceMay 2019Pages 563–568https://doi.org/10.1145/3308560.3317585

Published:13 May 2019Publication History

WWW '19: Companion Proceedings of The 2019 World Wide Web Conference

Pages 563–568

ABSTRACT

This paper suggests the use of automatic topic modeling for large-scale corpora of privacy policies using unsupervised learning techniques. The advantages of using unsupervised learning for this task are numerous. The primary advantages include the ability to analyze any new corpus with a fraction of the effort required by supervised learning, the ability to study changes in topics of interest along time, and the ability to identify finer-grained topics of interest in these privacy policies. Based on general principles of document analysis we synthesize a cohesive framework for privacy policy topic modeling and apply it over a corpus of 4,982 privacy policies of mobile applications crawled from the Google Play Store. The results demonstrate that even with this relatively moderate-size corpus quite comprehensive insights can be attained regarding the focus and scope of current privacy policy documents. The topics extracted, their structure and the applicability of the unsupervised approach for that matter are validated through an extensive comparison to similar findings reported in prior work that uses supervised learning (which heavily depends on manual annotation of experts). The comparison suggests a substantial overlap between the topics found and those reported in prior work, and also unveils some new topics of interest.

References

Yannis Bakos, Florencia Marotta-Wurgler, and David R Trossen. 2014. Does anyone read the fine print? Consumer attention to standard-form contracts. The Journal of Legal Studies 43, 1 (2014), 1–35.Google ScholarCross Ref
Omri Ben-Shahar and Adam Chilton. 2016. Simplification of privacy disclosures: an experimental test. The Journal of Legal Studies 45, S2 (2016), S41–S67.Google ScholarCross Ref
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3 (2003), 993–1022. Google ScholarDigital Library
Zhiyuan Chen and Bing Liu. 2014. Topic modeling using topics from many domains, lifelong learning and big data. In International Conference on Machine Learning. 703–711. Google ScholarDigital Library
Hyo Shin Choi, Won Sang Lee, and So Young Sohn. 2017. Analyzing research trends in personal information privacy using topic modeling. Computers & Security 67(2017), 244–253. Google ScholarDigital Library
Elisa Costante, Yuanhao Sun, Milan Petković, and Jerry den Hartog. 2012. A machine learning solution to assess privacy policy completeness. In Proc. of WPES. 91–96. Google ScholarDigital Library
M. Degeling, C. Utz, C. Lentzsch, H. Hosseini, F. Schaub, and T. Holz. 2018. We Value Your Privacy ... Now Take Some Cookies: Measuring the GDPR’s Impact on Web Privacy. ArXiv e-prints (2018).Google Scholar
GPEN. 2017. GPEN Sweep 2017 - User Controls over Personal information.Google Scholar
Derek Greene and James P Cross. 2017. Exploring the political agenda of the European parliament using a dynamic topic modeling approach. Political Analysis 25, 1 (2017).Google Scholar
Hamza Harkous, Kassem Fawaz, Rémi Lebret, Florian Schaub, Kang G. Shin, and Karl Aberer. 2018. Polisis: Automated Analysis and Presentation of Privacy Policies Using Deep Learning. In 27th USENIX Security Symposium (USENIX Security 18). USENIX Association, 531–548. https://www.usenix.org/conference/usenixsecurity18/presentation/harkous Google ScholarDigital Library
Thomas Hofmann. 1999. Probabilistic latent semantic analysis. In Proceedings of UAI. 289–296. Google ScholarDigital Library
Wei Li and Andrew McCallum. 2006. Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the 23rd international conference on Machine learning. 577–584. Google ScholarDigital Library
T. Linden, H. Harkous, and K. Fawaz. 2018. The Privacy Policy Landscape After the GDPR. ArXiv e-prints (Sept. 2018). arxiv:cs.CR/1809.08396Google Scholar
Fei Liu, Nicole Lee Fella, and Kexin Liao. 2016. Modeling language vagueness in privacy policies using deep neural networks. In AAAI Fall Symposium on Privacy and Language Technologies.Google Scholar
Fei Liu, Rohan Ramanath, Norman M. Sadeh, and Noah A. Smith. 2014. A Step Towards Usable Privacy Policy: Automatic Alignment of Privacy Statements. In COLING. ACL, 884–894.Google Scholar
Frederick Liu, Shomir Wilson, Peter Story, Sebastian Zimmeck, and Norman Sadeh. 2018. Towards Automatic Classification of Privacy Policy Text. CMU-ISR-17-118R, CMU-LTI-17-010 (June 2018).Google Scholar
Yue Lu and Chengxiang Zhai. 2008. Opinion Integration Through Semi-supervised Topic Modeling. In Proc. of WWW. 121–130. Google ScholarDigital Library
Florencia Marotta-Wurgler. 2012. Does Contract Disclosure Matter?JITE 168, 1 (2012), 94–119.Google Scholar
Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language Toolkit. (2002). http://mallet.cs.umass.edu.Google Scholar
Kate Niederhoffer, Jonathan Schler, Patrick Crutchley, Kate Loveys, and Glen Coppersmith. 2017. In your wildest dreams: the language and psychological features of dreams. In Proc. of CLPsych. 13–25.Google ScholarCross Ref
Christos H. Papadimitriou, Hisao Tamaki, Prabhakar Raghavan, and Santosh Vempala. 1998. Latent Semantic Indexing: A Probabilistic Analysis. In Proc. of PODS. 159–168. Google ScholarDigital Library
Rohan Ramanath, Fei Liu, Norman M. Sadeh, and Noah A. Smith. 2014. Unsupervised Alignment of Privacy Policies using Hidden Markov Models. In ACL (2). 605–610.Google Scholar
Kanthashree Mysore Sathyendra, Shomir Wilson, Florian Schaub, Sebastian Zimmeck, and Norman Sadeh. 2017. Identifying the Provision of Choices in Privacy Policy Text. In Proc. of EMNLP. 2764–2769.Google ScholarCross Ref
Yan Shvartzshnaider, Noah Apthorpe, Nick Feamster, and Helen Nissenbaum. 2018. Analyzing Privacy Policies Using Contextual Integrity Annotations. arXiv preprint arXiv:1809.02236(2018).Google Scholar
I. Stewart. 1996. Tales of a Neglected Number. Scientific American 274 (June 1996), 102–103.Google Scholar
Lior Jacob Strahilevitz and Matthew B Kugler. 2016. Is Privacy Policy Language Irrelevant to Consumers?The Journal of Legal Studies 45, S2 (2016), S69–S95.Google Scholar
Shomir Wilson, Florian Schaub, Aswarth Abhilash Dara, Frederick Liu, Sushain Cherivirala, Pedro Giovanni Leon, Mads Schaarup Andersen, Sebastian Zimmeck, Kanthashree Mysore Sathyendra, N. Cameron Russell, Thomas B. Norton, Eduard Hovy, Joel Reidenberg, and Norman Sadeh. 2016. The Creation and Analysis of a Website Privacy Policy Corpus. In Proc. of ACL. 1330–1340.Google ScholarCross Ref
Sebastian Zimmeck, Lieyong Zou Ziqi Wang, Bin Liu Roger Iyengar, Florian Schaub, Shomir Wilson, Norman Sadeh, Steven M. Bellovin, and Joel Reidenberg. 2017. Automated analysis of privacy requirements for mobile apps. In Proc. of NDSS.Google ScholarCross Ref

Index Terms

Unsupervised Topic Extraction from Privacy Policies

Index terms have been assigned to the content through auto-classification.

Recommendations

Twitter Opinion Topic Model: Extracting Product Opinions from Tweets by Leveraging Hashtags and Sentiment Lexicon
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

Aspect-based opinion mining is widely applied to review data to aggregate or summarize opinions of a product, and the current state-of-the-art is achieved with Latent Dirichlet Allocation (LDA)-based model. Although social media data like tweets are ...
Read More
A topic modeled unsupervised approach to single document extractive text summarization
Abstract
Automatic Text Summarization (ATS) is an essential field in natural language processing that attempts to condense large text documents so that users can assimilate information quickly. It finds uses in medical document summarization, ...
Read More
The dual-sparse topic model: mining focused topics and focused terms in short text
WWW '14: Proceedings of the 23rd international conference on World wide web

Topic modeling has been proved to be an effective method for exploratory text mining. It is a common assumption of most topic models that a document is generated from a mixture of topics. In real-world scenarios, individual documents usually concentrate ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '19: Companion Proceedings of The 2019 World Wide Web Conference
May 2019
1331 pages
ISBN:9781450366755
DOI:10.1145/3308560
Editors:
Ling Liu
Georgia Tech, USA
,
Ryen White
Microsoft Research, USA
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 May 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Topic modeling
privacy policies
unsuprevised learning
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 510
  Total Downloads
- Downloads (Last 12 months)59
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Unsupervised Topic Extraction from Privacy Policies

WWW '19: Companion Proceedings of The 2019 World Wide Web Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Twitter Opinion Topic Model: Extracting Product Opinions from Tweets by Leveraging Hashtags and Sentiment Lexicon

A topic modeled unsupervised approach to single document extractive text summarization

The dual-sparse topic model: mining focused topics and focused terms in short text

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Unsupervised Topic Extraction from Privacy Policies

WWW '19: Companion Proceedings of The 2019 World Wide Web Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Twitter Opinion Topic Model: Extracting Product Opinions from Tweets by Leveraging Hashtags and Sentiment Lexicon

A topic modeled unsupervised approach to single document extractive text summarization

The dual-sparse topic model: mining focused topics and focused terms in short text

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media