skip to main content
10.1145/3573128.3609342acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
short-paper
Public Access

Privacy Now or Never: Large-Scale Extraction and Analysis of Dates in Privacy Policy Text

Published: 22 August 2023 Publication History

Abstract

The General Data Protection Regulation (GDPR) and other recent privacy laws require organizations to post their privacy policies, and place specific expectations on organisations' privacy practices. Privacy policies take the form of documents written in natural language, and one of the expectations placed upon them is that they remain up to date. To investigate legal compliance with this recency requirement at a large scale, we create a novel pipeline that includes crawling, regex-based extraction, candidate date classification and date object creation to extract updated and effective dates from privacy policies written in English. We then analyze patterns in policy dates using four web crawls and find that only about 40% of privacy policies online contain a date, thereby making it difficult to assess their regulatory compliance. We also find that updates in privacy policies are temporally concentrated around passage of laws regulating digital privacy (such as the GDPR), and that more popular domains are more likely to have policy dates as well as more likely to update their policies regularly.

References

[1]
Angel X Chang and Christopher D Manning. 2012. Sutime: A library for recognizing and normalizing time expressions. In Lrec, Vol. 3735. 3740.
[2]
Moses S Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. ACM, 380--388.
[3]
Lorrie Faith Cranor. 2012. Necessary but not sufficient: Standardized mechanisms for privacy notice and choice. J. on Telecomm. & High Tech. L. 10 (2012), 273.
[4]
Martin Degeling, Christine Utz, Christopher Lentzsch, Henry Hosseini, Florian Schaub, and Thorsten Holz. 2018. We value your privacy... now take some cookies: Measuring the GDPR's impact on web privacy. arXiv preprint arXiv:1808.05096 (2018).
[5]
Beata Fonferko-Shadrach, Arron S Lacey, Angus Roberts, Ashley Akbari, Simon Thompson, David V Ford, Ronan A Lyons, Mark I Rees, and William Owen Pickrell. 2019. Using natural language processing to extract structured epilepsy data from unstructured clinic letters: development and validation of the ExECT (extraction of epilepsy clinical text) system. BMJ open 9, 4 (2019), e023232.
[6]
Julia T Fu, Evan Sholle, Spencer Krichevsky, Joseph Scandura, and Thomas R Campion. 2020. Extracting and classifying diagnosis dates from clinical notes: a case study. Journal of Biomedical Informatics 110 (2020), 103569.
[7]
Johanna Fulda, Matthew Brehmer, and Tamara Munzner. 2015. TimeLineCurator: Interactive authoring of visual timelines from unstructured text. IEEE transactions on visualization and computer graphics 22, 1 (2015), 300--309.
[8]
Sonu Gupta, Ellen Poplavska, Nora O'Toole, Siddhant Arora, Thomas Norton, Norman Sadeh, and Shomir Wilson. 2022. Creation and Analysis of an International Corpus of Privacy Laws. arXiv preprint arXiv:2206.14169 (2022).
[9]
Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. spaCy: Industrial-strength Natural Language Processing in Python. (2020).
[10]
Thomas Linden, Rishabh Khandelwal, Hamza Harkous, and Kassem Fawaz. 2020. The Privacy Policy Landscape After the GDPR. Proceedings on Privacy Enhancing Technologies 1 (2020), 47--64.
[11]
Marco Lui and Timothy Baldwin. 2012. langid. py: An off-the-shelf language identification tool. In Proceedings of the ACL 2012 system demonstrations. Association for Computational Linguistics, 25--30.
[12]
Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near-duplicates for web crawling. In Proceedings of the 16th international conference on World Wide Web. ACM, 141--150.
[13]
Anoop D Shah, Carlos Martinez, and Harry Hemingway. 2012. The freetext matching algorithm: a computer program to extract diagnoses and causes of death from unstructured text in electronic health records. BMC medical informatics and decision making 12, 1 (2012), 1--13.
[14]
Robert H Sloan and Richard Warner. 2014. Beyond notice and choice: Privacy, norms, and consent. J. High Tech. L. 14 (2014), 370.
[15]
David A Smith. 2002. Detecting events with date and place information in unstructured text. In Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries. 191--196.
[16]
Mukund Srinath, Soundarya Nurani Sundareswara, C Lee Giles, and Shomir Wilson. 2021. PrivaSeer: A Privacy Policy Search Engine. In International Conference on Web Engineering. Springer, 286--301.
[17]
Mukund Srinath, Shomir Wilson, and C Lee Giles. 2021. Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6829--6839.
[18]
Soundarya Sundareswara, Shomir Wilson, Mukund Srinath, and Lee Giles. 2020. Privacy not found: a study of the availability of privacy policies on the web.
[19]
Soundarya Nurani Sundareswara, Mukund Srinath, Shomir Wilson, and C. Lee Giles. 2021. A Large-Scale Exploration of Terms of Service Documents on the Web. In Proceedings of the 21st ACM Symposium on Document Engineering (Limerick, Ireland) (DocEng '21). Association for Computing Machinery, New York, NY, USA, Article 21, 4 pages. https://doi.org/10.1145/3469096.3474940

Cited By

View all
  • (2024)Exploring Privacy Practices of Female mHealth Apps in a Post-Roe WorldProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642521(1-24)Online publication date: 11-May-2024
  • (2024)Large language models: a new approach for privacy policy analysis at scaleComputing10.1007/s00607-024-01331-9106:12(3879-3903)Online publication date: 1-Dec-2024

Index Terms

  1. Privacy Now or Never: Large-Scale Extraction and Analysis of Dates in Privacy Policy Text

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    DocEng '23: Proceedings of the ACM Symposium on Document Engineering 2023
    August 2023
    187 pages
    ISBN:9798400700279
    DOI:10.1145/3573128
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 August 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. crawling
    2. date extraction
    3. privacy policy

    Qualifiers

    • Short-paper
    • Research
    • Refereed limited

    Funding Sources

    Conference

    DocEng '23
    Sponsor:
    DocEng '23: ACM Symposium on Document Engineering 2023
    August 22 - 25, 2023
    Limerick, Ireland

    Acceptance Rates

    DocEng '23 Paper Acceptance Rate 9 of 27 submissions, 33%;
    Overall Acceptance Rate 194 of 564 submissions, 34%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)136
    • Downloads (Last 6 weeks)25
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Exploring Privacy Practices of Female mHealth Apps in a Post-Roe WorldProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642521(1-24)Online publication date: 11-May-2024
    • (2024)Large language models: a new approach for privacy policy analysis at scaleComputing10.1007/s00607-024-01331-9106:12(3879-3903)Online publication date: 1-Dec-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media