skip to main content
10.1145/3573128.3604902acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article
Public Access

Privacy Lost and Found: An Investigation at Scale of Web Privacy Policy Availability

Published: 22 August 2023 Publication History

Abstract

Legal jurisdictions around the world require organisations to post privacy policies on their websites. However, in spite of laws such as GDPR and CCPA reinforcing this requirement, organisations sometimes do not comply, and a variety of semi-compliant failure modes exist. To investigate the landscape of web privacy policies, we crawl the privacy policies from 7 million organisation websites with the goal of identifying when policies are unavailable. We conduct a large-scale investigation of the availability of privacy policies and identify potential reasons for unavailability such as dead links, documents with empty content, documents that consist solely of placeholder text, and documents unavailable in the specific languages offered by their respective websites. We estimate the frequencies of these failure modes and the overall unavailability of privacy policies on the web and find that privacy policies URLs are only available in 34% of websites. Further, 1.37% of these URLs are broken links and 1.23% of the valid links lead to pages without a policy. Further, to enable investigation of privacy policies at scale, we use the capture-recapture technique to estimate the total number of English language privacy policies on the web and the distribution of these documents across top level domains and sectors of commerce. We estimate the lower bound on the number of English language privacy policies to be around 3 million. Finally, we release the CoLIPPs Corpus containing around 600k policies and their metadata consisting of policy URL, length, readability, sector of commerce, and policy crawl date.

References

[1]
Abdel-Jaouad Aberkane, Seppe Vanden Broucke, and Geert Poels. 2022. Investigating Organizational Factors Associated with GDPR Noncompliance using Privacy Policies: A Machine Learning Approach. In 2022 IEEE 4th International Conference on Trust, Privacy and Security in Intelligent Systems, and Applications (TPS-ISA). IEEE, 107--113.
[2]
Ryan Amos, Gunes Acar, Elena Lucherini, Mihir Kshirsagar, Arvind Narayanan, and Jonathan Mayer. 2020. Privacy Policies over Time: Curation andAnalysis of a Million-Document Dataset. arXiv preprint arXiv:2008.09159 (2020).
[3]
California State Assembly. 2020. California Consumer Privacy Act. https://leginfo.legislature.ca.gov/faces/codes_displayText.xhtml?division=3.&part=4.&lawCode=CIV&title=1.81.5.
[4]
Michael Begon et al. 1979. Investigating animal abundance: capture-recapture for biologists. Edward Arnold (Publishers) Ltd.
[5]
Dankmar Böhning, Irene Rocchetti, Antonello Maruotti, and Heinz Holling. 2020. Estimating the undetected infections in the Covid-19 outbreak by harnessing capture--recapture methods. International Journal of Infectious Diseases 97 (2020), 197--201.
[6]
Hermann Brenner. 1995. Use and limitations of the capture-recapture method in disease monitoring with two dependent sources. Epidemiology (1995), 42--48.
[7]
Moses S Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. ACM, 380--388.
[8]
Lorrie Faith Cranor, Candice Hoke, Pedro Giovanni Leon, and Alyssa Phung Au. 2014. Are They Worth Reading? An In-Depth Analysis of Online Advertising Companies' Privacy Policies.
[9]
Adrian Dobra and Stephen E Fienberg. 2004. How Large Is the World Wide Web? In Web dynamics. Springer, 23--43.
[10]
Tatiana Ermakova, Benjamin Fabian, and Eleonora Babina. 2015. Readability of Privacy Policies of Healthcare Websites. Wirtschaftsinformatik 15 (2015).
[11]
Benjamin Fabian, Tatiana Ermakova, and Tino Lentz. 2017. Large-scale readability analysis of privacy policies. In Proceedings of the International Conference on Web Intelligence. 18--25.
[12]
Sonu Gupta, Ellen Poplavska, Nora O'Toole, Siddhant Arora, Thomas Norton, Norman Sadeh, and Shomir Wilson. 2022. Creation and Analysis of an International Corpus of Privacy Laws. arXiv preprint arXiv:2206.14169 (2022).
[13]
Hamza Harkous, Kassem Fawaz, Rémi Lebret, Florian Schaub, Kang G. Shin, and Karl Aberer. 2018. Polisis: Automated Analysis and Presentation of Privacy Policies Using Deep Learning. In Proceedings of the 27th USENIX Conference on Security Symposium (Baltimore, MD, USA) (SEC'18). USENIX Association, USA, 531--548. https://www.usenix.org/conference/usenixsecurity18/presentation/harkous
[14]
Candice Hoke, Lorrie Faith Cranor, Pedro Giovanni Leon, and Alyssa Phung Au. 2015. Are They Worth Reading? An In-Depth Analysis of Online Trackers' Privacy Policies. I/S: A Journal of Law and Policy for the Information Society (2015).
[15]
Madian Khabsa and C Lee Giles. 2014. The number of scholarly documents on the public web. PloS one 9, 5 (2014), e93949.
[16]
J Peter Kincaid, Robert P Fishburne Jr, Richard L Rogers, and Brad S Chissom. 1975. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Technical Report. Naval Technical Training Command Millington TN Research Branch.
[17]
Michael Kretschmer, Jan Pennekamp, and Klaus Wehrle. 2021. Cookie banners and privacy policies: Measuring the impact of the GDPR on the web. ACM Transactions on the Web (TWEB) 15, 4 (2021), 1--42.
[18]
Priya C. Kumar. 2016. Privacy Policies and Their Lack of Clear Disclosure Regarding the Life Cycle of User Information. In AAAI Fall Symposia.
[19]
Steve Lawrence and C Lee Giles. 1998. Searching the world wide web. Science 280, 5360 (1998), 98--100.
[20]
Jianguo Lu and Dingding Li. 2010. Estimating deep web data source size by capture--recapture method. Information retrieval 13, 1 (2010), 70--95.
[21]
Marco Lui and Timothy Baldwin. 2012. langid.py: An Off-the-shelf Language Identification Tool. In Proceedings of the ACL 2012 System Demonstrations. Association for Computational Linguistics, Jeju Island, Korea, 25--30. https://www.aclweb.org/anthology/P12-3005
[22]
Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near-duplicates for web crawling. In Proceedings of the 16th international conference on World Wide Web. ACM, 141--150.
[23]
Aleecia M McDonald and Lorrie Faith Cranor. 2008. The cost of reading privacy policies. Isjlp 4 (2008), 543.
[24]
Gabriele Meiselwitz. 2013. Readability assessment of policies and procedures of social networking sites. In International Conference on Online Communities and Social Computing. Springer, 67--75.
[25]
Kenneth H Pollock, James D Nichols, Cavell Brownie, and James E Hines. 1990. Statistical inference for capture-recapture experiments. Wildlife monographs (1990), 3--97.
[26]
Ashwini Rao, Florian Schaub, Norman Sadeh, Alessandro Acquisti, and Ruogu Kang. 2016. Expecting the Unexpected: Understanding Mismatched Privacy Expectations Online. In Twelfth Symposium on Usable Privacy and Security (SOUPS 2016). USENIX Association, Denver, CO, 77--96. https://www.usenix.org/conference/soups2016/technical-sessions/presentation/rao
[27]
Abhilasha Ravichander, Alan W Black, Shomir Wilson, Thomas Norton, and Norman Sadeh. 2019. Question Answering for Privacy Policies: Combining Computational and Legal Perspectives. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 4949--4959.
[28]
Kanthashree Mysore Sathyendra, Florian Schaub, Shomir Wilson, and Norman Sadeh. 2016. Automatic extraction of opt-out choices from privacy policies. In 2016 AAAI Fall Symposium Series.
[29]
Kanthashree Mysore Sathyendra, Shomir Wilson, Florian Schaub, Sebastian Zimmeck, and Norman Sadeh. 2017. Identifying the provision of choices in privacy policy text. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2774--2779.
[30]
Mukund Srinath, Soundarya Nurani Sundareswara, C Lee Giles, and Shomir Wilson. 2021. PrivaSeer: A Privacy Policy Search Engine. In International Conference on Web Engineering. Springer, 286--301.
[31]
Mukund Srinath, Shomir Wilson, and C Lee Giles. 2021. Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 6829--6839.
[32]
Peter Story, Sebastian Zimmeck, Abhilasha Ravichander, Daniel Smullen, Ziqi Wang, Joel Reidenberg, N Cameron Russell, and Norman Sadeh. 2019. Natural language processing for mobile app privacy compliance. In AAAI Spring Symposium on Privacy-Enhancing Artificial Intelligence and Language Technologies.
[33]
Soundarya Sundareswara, Shomir Wilson, Mukund Srinath, and Lee Giles. 2020. Privacy not found: a study of the availability of privacy policies on the web.
[34]
Soundarya Nurani Sundareswara, Mukund Srinath, Shomir Wilson, and C. Lee Giles. 2021. A Large-Scale Exploration of Terms of Service Documents on the Web. In Proceedings of the 21st ACM Symposium on Document Engineering (Limerick, Ireland) (DocEng '21). Association for Computing Machinery, New York, NY, USA, Article 21, 4 pages. https://doi.org/10.1145/3469096.3474940
[35]
Ali Sunyaev, Tobias Dehling, Patrick Taylor, and Kenneth Mandl. 2014. Availability and Quality of Mobile Health App Privacy Policies. Journal of the American Medical Informatics Association (08 2014), 1--4. https://doi.org/10.1136/amiajnl-2013-002605
[36]
Factsheet 5: European Data Protection Supervisor. 2018. What to expect when we inspect. (2018). https://edps.europa.eu/sites/edp/files/publication/18-11-21_factsheet_inspections_en.pdf
[37]
Verisign. [n.d.]. VERISIGN Q1 2020 DOMAIN NAME INDUSTRY BRIEF. https://blog.verisign.com/domain-names/verisign-q1-2020-domain-name-industry-brief-internet-grows-to-366-8-million-domain-name-registrations-in-the-first-quarter-of-2020/
[38]
Rhiannon Weaver and M Patrick Collins. 2007. Fishing for phishes: Applying capture-recapture methods to estimate phishing populations. In Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit. 14--25.
[39]
Wikipedia contributors. 2020. Lorem ipsum --- Wikipedia, The Free Encyclopedia. [Online; accessed 12-May-2020].

Cited By

View all
  • (2024)Privacy at Risk: An Investigation of Data Collection Practices and Tracking Scripts on Government Websites in Java2024 Seventh International Conference on Vocational Education and Electrical Engineering (ICVEE)10.1109/ICVEE63912.2024.10823899(188-193)Online publication date: 30-Oct-2024

Index Terms

  1. Privacy Lost and Found: An Investigation at Scale of Web Privacy Policy Availability

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      DocEng '23: Proceedings of the ACM Symposium on Document Engineering 2023
      August 2023
      187 pages
      ISBN:9798400700279
      DOI:10.1145/3573128
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 August 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Badges

      • Best Student Paper

      Author Tags

      1. capture-recapture
      2. policy availability
      3. privacy
      4. privacy policy

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      Conference

      DocEng '23
      Sponsor:
      DocEng '23: ACM Symposium on Document Engineering 2023
      August 22 - 25, 2023
      Limerick, Ireland

      Acceptance Rates

      DocEng '23 Paper Acceptance Rate 9 of 27 submissions, 33%;
      Overall Acceptance Rate 194 of 564 submissions, 34%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)147
      • Downloads (Last 6 weeks)17
      Reflects downloads up to 01 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Privacy at Risk: An Investigation of Data Collection Practices and Tracking Scripts on Government Websites in Java2024 Seventh International Conference on Vocational Education and Electrical Engineering (ICVEE)10.1109/ICVEE63912.2024.10823899(188-193)Online publication date: 30-Oct-2024

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media