ABSTRACT
This paper suggests the use of automatic topic modeling for large-scale corpora of privacy policies using unsupervised learning techniques. The advantages of using unsupervised learning for this task are numerous. The primary advantages include the ability to analyze any new corpus with a fraction of the effort required by supervised learning, the ability to study changes in topics of interest along time, and the ability to identify finer-grained topics of interest in these privacy policies. Based on general principles of document analysis we synthesize a cohesive framework for privacy policy topic modeling and apply it over a corpus of 4,982 privacy policies of mobile applications crawled from the Google Play Store. The results demonstrate that even with this relatively moderate-size corpus quite comprehensive insights can be attained regarding the focus and scope of current privacy policy documents. The topics extracted, their structure and the applicability of the unsupervised approach for that matter are validated through an extensive comparison to similar findings reported in prior work that uses supervised learning (which heavily depends on manual annotation of experts). The comparison suggests a substantial overlap between the topics found and those reported in prior work, and also unveils some new topics of interest.
- Yannis Bakos, Florencia Marotta-Wurgler, and David R Trossen. 2014. Does anyone read the fine print? Consumer attention to standard-form contracts. The Journal of Legal Studies 43, 1 (2014), 1–35.Google ScholarCross Ref
- Omri Ben-Shahar and Adam Chilton. 2016. Simplification of privacy disclosures: an experimental test. The Journal of Legal Studies 45, S2 (2016), S41–S67.Google ScholarCross Ref
- David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3 (2003), 993–1022. Google ScholarDigital Library
- Zhiyuan Chen and Bing Liu. 2014. Topic modeling using topics from many domains, lifelong learning and big data. In International Conference on Machine Learning. 703–711. Google ScholarDigital Library
- Hyo Shin Choi, Won Sang Lee, and So Young Sohn. 2017. Analyzing research trends in personal information privacy using topic modeling. Computers & Security 67(2017), 244–253. Google ScholarDigital Library
- Elisa Costante, Yuanhao Sun, Milan Petković, and Jerry den Hartog. 2012. A machine learning solution to assess privacy policy completeness. In Proc. of WPES. 91–96. Google ScholarDigital Library
- M. Degeling, C. Utz, C. Lentzsch, H. Hosseini, F. Schaub, and T. Holz. 2018. We Value Your Privacy ... Now Take Some Cookies: Measuring the GDPR’s Impact on Web Privacy. ArXiv e-prints (2018).Google Scholar
- GPEN. 2017. GPEN Sweep 2017 - User Controls over Personal information.Google Scholar
- Derek Greene and James P Cross. 2017. Exploring the political agenda of the European parliament using a dynamic topic modeling approach. Political Analysis 25, 1 (2017).Google Scholar
- Hamza Harkous, Kassem Fawaz, Rémi Lebret, Florian Schaub, Kang G. Shin, and Karl Aberer. 2018. Polisis: Automated Analysis and Presentation of Privacy Policies Using Deep Learning. In 27th USENIX Security Symposium (USENIX Security 18). USENIX Association, 531–548. https://www.usenix.org/conference/usenixsecurity18/presentation/harkous Google ScholarDigital Library
- Thomas Hofmann. 1999. Probabilistic latent semantic analysis. In Proceedings of UAI. 289–296. Google ScholarDigital Library
- Wei Li and Andrew McCallum. 2006. Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the 23rd international conference on Machine learning. 577–584. Google ScholarDigital Library
- T. Linden, H. Harkous, and K. Fawaz. 2018. The Privacy Policy Landscape After the GDPR. ArXiv e-prints (Sept. 2018). arxiv:cs.CR/1809.08396Google Scholar
- Fei Liu, Nicole Lee Fella, and Kexin Liao. 2016. Modeling language vagueness in privacy policies using deep neural networks. In AAAI Fall Symposium on Privacy and Language Technologies.Google Scholar
- Fei Liu, Rohan Ramanath, Norman M. Sadeh, and Noah A. Smith. 2014. A Step Towards Usable Privacy Policy: Automatic Alignment of Privacy Statements. In COLING. ACL, 884–894.Google Scholar
- Frederick Liu, Shomir Wilson, Peter Story, Sebastian Zimmeck, and Norman Sadeh. 2018. Towards Automatic Classification of Privacy Policy Text. CMU-ISR-17-118R, CMU-LTI-17-010 (June 2018).Google Scholar
- Yue Lu and Chengxiang Zhai. 2008. Opinion Integration Through Semi-supervised Topic Modeling. In Proc. of WWW. 121–130. Google ScholarDigital Library
- Florencia Marotta-Wurgler. 2012. Does Contract Disclosure Matter?JITE 168, 1 (2012), 94–119.Google Scholar
- Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language Toolkit. (2002). http://mallet.cs.umass.edu.Google Scholar
- Kate Niederhoffer, Jonathan Schler, Patrick Crutchley, Kate Loveys, and Glen Coppersmith. 2017. In your wildest dreams: the language and psychological features of dreams. In Proc. of CLPsych. 13–25.Google ScholarCross Ref
- Christos H. Papadimitriou, Hisao Tamaki, Prabhakar Raghavan, and Santosh Vempala. 1998. Latent Semantic Indexing: A Probabilistic Analysis. In Proc. of PODS. 159–168. Google ScholarDigital Library
- Rohan Ramanath, Fei Liu, Norman M. Sadeh, and Noah A. Smith. 2014. Unsupervised Alignment of Privacy Policies using Hidden Markov Models. In ACL (2). 605–610.Google Scholar
- Kanthashree Mysore Sathyendra, Shomir Wilson, Florian Schaub, Sebastian Zimmeck, and Norman Sadeh. 2017. Identifying the Provision of Choices in Privacy Policy Text. In Proc. of EMNLP. 2764–2769.Google ScholarCross Ref
- Yan Shvartzshnaider, Noah Apthorpe, Nick Feamster, and Helen Nissenbaum. 2018. Analyzing Privacy Policies Using Contextual Integrity Annotations. arXiv preprint arXiv:1809.02236(2018).Google Scholar
- I. Stewart. 1996. Tales of a Neglected Number. Scientific American 274 (June 1996), 102–103.Google Scholar
- Lior Jacob Strahilevitz and Matthew B Kugler. 2016. Is Privacy Policy Language Irrelevant to Consumers?The Journal of Legal Studies 45, S2 (2016), S69–S95.Google Scholar
- Shomir Wilson, Florian Schaub, Aswarth Abhilash Dara, Frederick Liu, Sushain Cherivirala, Pedro Giovanni Leon, Mads Schaarup Andersen, Sebastian Zimmeck, Kanthashree Mysore Sathyendra, N. Cameron Russell, Thomas B. Norton, Eduard Hovy, Joel Reidenberg, and Norman Sadeh. 2016. The Creation and Analysis of a Website Privacy Policy Corpus. In Proc. of ACL. 1330–1340.Google ScholarCross Ref
- Sebastian Zimmeck, Lieyong Zou Ziqi Wang, Bin Liu Roger Iyengar, Florian Schaub, Shomir Wilson, Norman Sadeh, Steven M. Bellovin, and Joel Reidenberg. 2017. Automated analysis of privacy requirements for mobile apps. In Proc. of NDSS.Google ScholarCross Ref
Index Terms
- Unsupervised Topic Extraction from Privacy Policies
Recommendations
Twitter Opinion Topic Model: Extracting Product Opinions from Tweets by Leveraging Hashtags and Sentiment Lexicon
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge ManagementAspect-based opinion mining is widely applied to review data to aggregate or summarize opinions of a product, and the current state-of-the-art is achieved with Latent Dirichlet Allocation (LDA)-based model. Although social media data like tweets are ...
A topic modeled unsupervised approach to single document extractive text summarization
AbstractAutomatic Text Summarization (ATS) is an essential field in natural language processing that attempts to condense large text documents so that users can assimilate information quickly. It finds uses in medical document summarization, ...
The dual-sparse topic model: mining focused topics and focused terms in short text
WWW '14: Proceedings of the 23rd international conference on World wide webTopic modeling has been proved to be an effective method for exploratory text mining. It is a common assumption of most topic models that a document is generated from a mixture of topics. In real-world scenarios, individual documents usually concentrate ...
Comments