skip to main content
10.1145/3308560.3317585acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Unsupervised Topic Extraction from Privacy Policies

Published:13 May 2019Publication History

ABSTRACT

This paper suggests the use of automatic topic modeling for large-scale corpora of privacy policies using unsupervised learning techniques. The advantages of using unsupervised learning for this task are numerous. The primary advantages include the ability to analyze any new corpus with a fraction of the effort required by supervised learning, the ability to study changes in topics of interest along time, and the ability to identify finer-grained topics of interest in these privacy policies. Based on general principles of document analysis we synthesize a cohesive framework for privacy policy topic modeling and apply it over a corpus of 4,982 privacy policies of mobile applications crawled from the Google Play Store. The results demonstrate that even with this relatively moderate-size corpus quite comprehensive insights can be attained regarding the focus and scope of current privacy policy documents. The topics extracted, their structure and the applicability of the unsupervised approach for that matter are validated through an extensive comparison to similar findings reported in prior work that uses supervised learning (which heavily depends on manual annotation of experts). The comparison suggests a substantial overlap between the topics found and those reported in prior work, and also unveils some new topics of interest.

References

  1. Yannis Bakos, Florencia Marotta-Wurgler, and David R Trossen. 2014. Does anyone read the fine print? Consumer attention to standard-form contracts. The Journal of Legal Studies 43, 1 (2014), 1–35.Google ScholarGoogle ScholarCross RefCross Ref
  2. Omri Ben-Shahar and Adam Chilton. 2016. Simplification of privacy disclosures: an experimental test. The Journal of Legal Studies 45, S2 (2016), S41–S67.Google ScholarGoogle ScholarCross RefCross Ref
  3. David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3 (2003), 993–1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Zhiyuan Chen and Bing Liu. 2014. Topic modeling using topics from many domains, lifelong learning and big data. In International Conference on Machine Learning. 703–711. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Hyo Shin Choi, Won Sang Lee, and So Young Sohn. 2017. Analyzing research trends in personal information privacy using topic modeling. Computers & Security 67(2017), 244–253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Elisa Costante, Yuanhao Sun, Milan Petković, and Jerry den Hartog. 2012. A machine learning solution to assess privacy policy completeness. In Proc. of WPES. 91–96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Degeling, C. Utz, C. Lentzsch, H. Hosseini, F. Schaub, and T. Holz. 2018. We Value Your Privacy ... Now Take Some Cookies: Measuring the GDPR’s Impact on Web Privacy. ArXiv e-prints (2018).Google ScholarGoogle Scholar
  8. GPEN. 2017. GPEN Sweep 2017 - User Controls over Personal information.Google ScholarGoogle Scholar
  9. Derek Greene and James P Cross. 2017. Exploring the political agenda of the European parliament using a dynamic topic modeling approach. Political Analysis 25, 1 (2017).Google ScholarGoogle Scholar
  10. Hamza Harkous, Kassem Fawaz, Rémi Lebret, Florian Schaub, Kang G. Shin, and Karl Aberer. 2018. Polisis: Automated Analysis and Presentation of Privacy Policies Using Deep Learning. In 27th USENIX Security Symposium (USENIX Security 18). USENIX Association, 531–548. https://www.usenix.org/conference/usenixsecurity18/presentation/harkous Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Thomas Hofmann. 1999. Probabilistic latent semantic analysis. In Proceedings of UAI. 289–296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Wei Li and Andrew McCallum. 2006. Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the 23rd international conference on Machine learning. 577–584. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Linden, H. Harkous, and K. Fawaz. 2018. The Privacy Policy Landscape After the GDPR. ArXiv e-prints (Sept. 2018). arxiv:cs.CR/1809.08396Google ScholarGoogle Scholar
  14. Fei Liu, Nicole Lee Fella, and Kexin Liao. 2016. Modeling language vagueness in privacy policies using deep neural networks. In AAAI Fall Symposium on Privacy and Language Technologies.Google ScholarGoogle Scholar
  15. Fei Liu, Rohan Ramanath, Norman M. Sadeh, and Noah A. Smith. 2014. A Step Towards Usable Privacy Policy: Automatic Alignment of Privacy Statements. In COLING. ACL, 884–894.Google ScholarGoogle Scholar
  16. Frederick Liu, Shomir Wilson, Peter Story, Sebastian Zimmeck, and Norman Sadeh. 2018. Towards Automatic Classification of Privacy Policy Text. CMU-ISR-17-118R, CMU-LTI-17-010 (June 2018).Google ScholarGoogle Scholar
  17. Yue Lu and Chengxiang Zhai. 2008. Opinion Integration Through Semi-supervised Topic Modeling. In Proc. of WWW. 121–130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Florencia Marotta-Wurgler. 2012. Does Contract Disclosure Matter?JITE 168, 1 (2012), 94–119.Google ScholarGoogle Scholar
  19. Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language Toolkit. (2002). http://mallet.cs.umass.edu.Google ScholarGoogle Scholar
  20. Kate Niederhoffer, Jonathan Schler, Patrick Crutchley, Kate Loveys, and Glen Coppersmith. 2017. In your wildest dreams: the language and psychological features of dreams. In Proc. of CLPsych. 13–25.Google ScholarGoogle ScholarCross RefCross Ref
  21. Christos H. Papadimitriou, Hisao Tamaki, Prabhakar Raghavan, and Santosh Vempala. 1998. Latent Semantic Indexing: A Probabilistic Analysis. In Proc. of PODS. 159–168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Rohan Ramanath, Fei Liu, Norman M. Sadeh, and Noah A. Smith. 2014. Unsupervised Alignment of Privacy Policies using Hidden Markov Models. In ACL (2). 605–610.Google ScholarGoogle Scholar
  23. Kanthashree Mysore Sathyendra, Shomir Wilson, Florian Schaub, Sebastian Zimmeck, and Norman Sadeh. 2017. Identifying the Provision of Choices in Privacy Policy Text. In Proc. of EMNLP. 2764–2769.Google ScholarGoogle ScholarCross RefCross Ref
  24. Yan Shvartzshnaider, Noah Apthorpe, Nick Feamster, and Helen Nissenbaum. 2018. Analyzing Privacy Policies Using Contextual Integrity Annotations. arXiv preprint arXiv:1809.02236(2018).Google ScholarGoogle Scholar
  25. I. Stewart. 1996. Tales of a Neglected Number. Scientific American 274 (June 1996), 102–103.Google ScholarGoogle Scholar
  26. Lior Jacob Strahilevitz and Matthew B Kugler. 2016. Is Privacy Policy Language Irrelevant to Consumers?The Journal of Legal Studies 45, S2 (2016), S69–S95.Google ScholarGoogle Scholar
  27. Shomir Wilson, Florian Schaub, Aswarth Abhilash Dara, Frederick Liu, Sushain Cherivirala, Pedro Giovanni Leon, Mads Schaarup Andersen, Sebastian Zimmeck, Kanthashree Mysore Sathyendra, N. Cameron Russell, Thomas B. Norton, Eduard Hovy, Joel Reidenberg, and Norman Sadeh. 2016. The Creation and Analysis of a Website Privacy Policy Corpus. In Proc. of ACL. 1330–1340.Google ScholarGoogle ScholarCross RefCross Ref
  28. Sebastian Zimmeck, Lieyong Zou Ziqi Wang, Bin Liu Roger Iyengar, Florian Schaub, Shomir Wilson, Norman Sadeh, Steven M. Bellovin, and Joel Reidenberg. 2017. Automated analysis of privacy requirements for mobile apps. In Proc. of NDSS.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Unsupervised Topic Extraction from Privacy Policies
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Other conferences
              WWW '19: Companion Proceedings of The 2019 World Wide Web Conference
              May 2019
              1331 pages
              ISBN:9781450366755
              DOI:10.1145/3308560

              Copyright © 2019 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 13 May 2019

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed limited

              Acceptance Rates

              Overall Acceptance Rate1,899of8,196submissions,23%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format .

            View HTML Format