skip to main content
10.1145/3545948.3545952acmotherconferencesArticle/Chapter ViewAbstractPublication PagesraidConference Proceedingsconference-collections
research-article

Viopolicy-Detector: An Automated Approach to Detecting GDPR Suspected Compliance Violations in Websites

Published: 26 October 2022 Publication History

Abstract

To provide users with personalized services, the website collects and tracks user’s activity data. At the same time, each website uses a privacy policy to ensure the legality of these actions. The purpose of the implementation of the General Data Protection Regulation (GDPR) is to protect the privacy of user data. Because GDPR is a programmatic regulation, there is no specific guidance on what a privacy policy should contain. Therefore, there may still be potential violations on the website, thus cause a risk of leak users’ private data. In this paper, we define a violating behavior that data collected by the website without a declaration in the privacy policy is illegal. To complete the violating behavior detection, we first interpret the GDPR and analyze 1000 website privacy policies to present a personal data classification including eight categories. Based on this, we propose a privacy policy annotation scheme including these eight categories and collect 145 related Web APIs. Then we propose an automated method to detect GDPR suspected compliance violations in websites. On the one hand we use the multi-label text classification model to extract data collection stated in the privacy policy, with a precision of 0.9817. For another, we dynamically monitor the JavaScript calls of the website related to personal data collection during user visits. Finally, we compare the two results to determine whether violating behaviors appeared. We use this method to detect the European top 500 websites (actually 451 websites). A total of 159 (35.3%) websites appear in violation of the GDPR. We analyze the detection results from different perspectives, including statistics on the types of data declared in the privacy policy, statistics on data collected by the website, and which data collection is likely to cause violations. Then we classify the violating websites and find that websites in the Social category present the most violations. Finally, we count the rankings of the offending websites. Surprisingly, top-ranking sites are even more prone to breaches. There are even some globally well-known websites with violations, such as BBC, Nokia, Ebay, Google etc.

References

[1]
Jan M Bauer, Regitze Bergstrøm, and Rune Foss-Madsen. 2021. Are you sure, you want a cookie?–The effects of choice architecture on users’ decisions about sharing private online data. Computers in Human Behavior 120 (2021), 106729.
[2]
Jaspreet Bhatia, Travis D Breaux, and Florian Schaub. 2016. Mining privacy goals from privacy policies using hybridized task recomposition. ACM Transactions on Software Engineering and Methodology (TOSEM) 25, 3(2016), 1–24.
[3]
Anish Chapagain. 2019. Hands-On Web Scraping with Python: Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others. Packt Publishing Ltd.
[4]
Bart Custers, Alan M Sears, Francien Dechesne, Ilina Georgieva, Tommaso Tani, and Simone Van der Hof. 2019. EU personal data protection in policy and practice. Springer.
[5]
Martin Degeling, Christine Utz, Christopher Lentzsch, Henry Hosseini, Florian Schaub, and Thorsten Holz. 2018. We value your privacy... now take some cookies: Measuring the GDPR’s impact on web privacy. arXiv preprint arXiv:1808.05096(2018).
[6]
Fahimeh Ebrahimi, Miroslav Tushev, and Anas Mahmoud. 2021. Mobile app privacy in software engineering research: A systematic mapping study. Information and Software Technology 133 (2021), 106466.
[7]
Lavanya Elluri, Sai Sree Laya Chukkapalli, Karuna Pande Joshi, Tim Finin, and Anupam Joshi. 2021. A BERT Based Approach to Measure Web Services Policies Compliance With GDPR. IEEE Access 9(2021), 148004–148016.
[8]
Ming Fan, Le Yu, Sen Chen, Hao Zhou, Xiapu Luo, Shuyue Li, Yang Liu, Jun Liu, and Ting Liu. 2020. An empirical evaluation of GDPR compliance violations in Android mHealth apps. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). IEEE, 253–264.
[9]
Imane Fouad, Cristiana Santos, Feras Al Kassar, Nataliia Bielova, and Stefano Calzavara. 2020. On compliance of cookie purposes with the purpose specification principle. In 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW). IEEE, 326–333.
[10]
Miguel Grinberg. 2018. Flask web development: developing web applications with python. ” O’Reilly Media, Inc.”.
[11]
Hamza Harkous, Kassem Fawaz, Rémi Lebret, Florian Schaub, Kang G Shin, and Karl Aberer. 2018. Polisis: Automated analysis and presentation of privacy policies using deep learning. In 27th USENIX Security Symposium (USENIX Security 18). 531–548.
[12]
Xiaoyu He, Xiaofei Xie, Yuekang Li, Jianwen Sun, Feng Li, Wei Zou, Yang Liu, Lei Yu, Jianhua Zhou, Wenchang Shi, 2021. SoFi: Reflection-Augmented Fuzzing for JavaScript Engines. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 2229–2242.
[13]
Yunhua Huang, Tao Li, Lijia Zhang, Beibei Li, and Xiaojie Liu. 2021. JSContana: Malicious JavaScript detection using adaptable context analysis and key feature extraction. Computers & Security 104 (2021), 102218.
[14]
David A Hyman and William E Kovacic. 2018. Implementing Privacy Policy: Who Should Do What. Fordham Intell. Prop. Media & Ent. LJ 29 (2018), 1117.
[15]
Dong Jin. 2021. Image Information Collection System Based on Python Web Crawler Technology. CONVERTER (2021), 606–612.
[16]
Patrick Gage Kelley, Lucian Cesca, Joanna Bresee, and Lorrie Faith Cranor. 2010. Standardizing privacy notices: an online study of the nutrition label approach. In Proceedings of the SIGCHI Conference on Human factors in Computing Systems. 1573–1582.
[17]
Ronald Leenes and Eleni Kosta. 2015. Taming the cookie monster with dutch law–a tale of regulatory failure. Computer Law & Security Review 31, 3 (2015), 317–335.
[18]
Timothy Libert, Lucas Graves, and Rasmus Kleis Nielsen. 2018. Changes in third-party content on European News Websites after GDPR. (2018).
[19]
Thomas Linden, Rishabh Khandelwal, Hamza Harkous, and Kassem Fawaz. 2020. The privacy policy landscape after the GDPR. Proceedings on Privacy Enhancing Technologies 2020, 1(2020), 47–64.
[20]
Bo Liu, Ming Ding, Sina Shaham, Wenny Rahayu, Farhad Farokhi, and Zihuai Lin. 2021. When machine learning meets privacy: A survey and outlook. ACM Computing Surveys (CSUR) 54, 2 (2021), 1–36.
[21]
Célestin Matte, Nataliia Bielova, and Cristiana Santos. 2020. Do cookie banners respect my choice?: Measuring legal compliance of banners from iab europe’s transparency and consent framework. In 2020 IEEE Symposium on Security and Privacy (SP). IEEE, 791–809.
[22]
Eric Matthes. 2019. Python crash course: A hands-on, project-based introduction to programming. no starch press.
[23]
Aleecia M Mcdonald, Robert W Reeder, Patrick Gage Kelley, and Lorrie Faith Cranor. 2009. A comparative study of online privacy policies and formats. In International Symposium on Privacy Enhancing Technologies Symposium. Springer, 37–55.
[24]
Yannic Meier, Johanna Schäwel, and Nicole C Krämer. 2020. The shorter the better? Effects of privacy policy length on online privacy decision-making. Media and Communication 8, 2 (2020), 291–301.
[25]
William Melicher, Mahmood Sharif, Joshua Tan, Lujo Bauer, Mihai Christodorescu, and Pedro Giovanni Leon. 2016. (Do Not) Track me sometimes: Users’ contextual preferences for Web tracking. Proceedings on Privacy Enhancing Technologies 2016, 2(2016), 135–154.
[26]
Aaron Meurer, Christopher P Smith, Mateusz Paprocki, Ondřej Čertík, Sergey B Kirpichev, Matthew Rocklin, AMiT Kumar, Sergiu Ivanov, Jason K Moore, Sartaj Singh, 2017. SymPy: symbolic computing in Python. PeerJ Computer Science 3 (2017), e103.
[27]
Nicolas M Müller, Daniel Kowatsch, Pascal Debus, Donika Mirdita, and Konstantin Böttinger. 2019. On GDPR compliance of companies’ privacy policies. In International Conference on Text, Speech, and Dialogue. Springer, 151–159.
[28]
Midas Nouwens, Ilaria Liccardi, Michael Veale, David Karger, and Lalana Kagal. 2020. Dark patterns after the GDPR: Scraping consent pop-ups and demonstrating their influence. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–13.
[29]
Emmanouil Papadogiannakis, Panagiotis Papadopoulos, Nicolas Kourtellis, and Evangelos P Markatos. 2021. User tracking in the post-cookie era: How websites bypass gdpr consent to track users. In Proceedings of the Web Conference 2021. 2130–2141.
[30]
Alfredo J Perez, Sherali Zeadally, and Jonathan Cochran. 2018. A review and an empirical analysis of privacy policy and notices for consumer Internet of things. Security and Privacy 1, 3 (2018), e15.
[31]
Abbas Razaghpanah, Rishab Nithyanand, Narseo Vallina-Rodriguez, Srikanth Sundaresan, Mark Allman, Christian Kreibich, Phillipa Gill, 2018. Apps, trackers, privacy, and regulators: A global study of the mobile tracking ecosystem. In The 25th Annual Network and Distributed System Security Symposium (NDSS 2018).
[32]
Elly Rosmaini, T Fabrianti Kusumasari, Muharman Lubis, and A Ridho Lubis. 2018. Insights to develop privacy policy for organization in Indonesia. In Journal of Physics: Conference Series, Vol. 978. IOP Publishing, 012042.
[33]
Beata A Safari. 2016. Intangible privacy rights: How europe’s gdpr will set a new global standard for personal data protection. Seton Hall L. Rev. 47(2016), 809.
[34]
Iskander Sanchez-Rola, Matteo Dell’Amico, Platon Kotzias, Davide Balzarotti, Leyla Bilge, Pierre-Antoine Vervier, and Igor Santos. 2019. Can i opt out yet? gdpr and the global illusion of cookie control. In Proceedings of the 2019 ACM Asia conference on computer and communications security. 340–351.
[35]
Cristiana Santos, Nataliia Bielova, and Célestin Matte. 2019. Are cookie banners indeed compliant with the law? deciphering eu legal requirements on consent and technical means to verify compliance of cookie banners. arXiv preprint arXiv:1912.07144(2019).
[36]
Cristiana Santos, Arianna Rossi, Lorena Sanchez Chamorro, Kerstin Bongard-Blanchy, and Ruba Abu-Salma. 2021. Cookie Banners, What’s the Purpose? Analyzing Cookie Banner Text Through a Legal Lens. In Proceedings of the 20th Workshop on Workshop on Privacy in the Electronic Society. 187–194.
[37]
Kanthashree Mysore Sathyendra, Shomir Wilson, Florian Schaub, Sebastian Zimmeck, and Norman Sadeh. 2017. Identifying the provision of choices in privacy policy text. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2774–2779.
[38]
Florian Schaub, Rebecca Balebako, Adam L Durity, and Lorrie Faith Cranor. 2015. A design space for effective privacy notices. In Eleventh symposium on usable privacy and security (SOUPS 2015). 1–17.
[39]
Sarthak J Shetty and Vijay Ramesh. 2021. pyResearchInsights—An open-source Python package for scientific text analysis. Ecology and Evolution 11, 20 (2021), 13920–13929.
[40]
Welderufael B Tesfay, Peter Hofmann, Toru Nakamura, Shinsaku Kiyomoto, and Jetzabel Serna. 2018. PrivacyGuide: towards an implementation of the EU GDPR on internet privacy policy evaluation. In Proceedings of the Fourth ACM International Workshop on Security and Privacy Analytics. 15–21.
[41]
Damiano Torre, Sallam Abualhaija, Mehrdad Sabetzadeh, Lionel Briand, Katrien Baetens, Peter Goes, and Sylvie Forastier. 2020. An ai-assisted approach for checking the completeness of privacy policies against gdpr. In 2020 IEEE 28th International Requirements Engineering Conference (RE). IEEE, 136–146.
[42]
Christine Utz, Martin Degeling, Sascha Fahl, Florian Schaub, and Thorsten Holz. 2019. (Un) informed consent: Studying GDPR consent notices in the field. In Proceedings of the 2019 acm sigsac conference on computer and communications security. 973–990.
[43]
Pelayo Vallina, Álvaro Feal, Julien Gamba, Narseo Vallina-Rodriguez, and Antonio Fernández Anta. 2019. Tales from the porn: A comprehensive privacy analysis of the web porn ecosystem. In Proceedings of the Internet Measurement Conference. 245–258.
[44]
Rob Van Eijk, Hadi Asghari, Philipp Winter, and Arvind Narayanan. 2021. The impact of user location on cookie notices (inside and outside of the European Union). arXiv preprint arXiv:2110.09832(2021).
[45]
Meng Wang, Javier Santillan, and Fernando Kuipers. 2018. Thingpot: an interactive internet-of-things honeypot. arXiv preprint arXiv:1807.04114(2018).
[46]
Shomir Wilson, Florian Schaub, Aswarth Abhilash Dara, Frederick Liu, Sushain Cherivirala, Pedro Giovanni Leon, Mads Schaarup Andersen, Sebastian Zimmeck, Kanthashree Mysore Sathyendra, N Cameron Russell, 2016. The creation and analysis of a website privacy policy corpus. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1330–1340.
[47]
Stephanie Winkler and Sherali Zeadally. 2016. Privacy policy analysis of popular web platforms. IEEE technology and society magazine 35, 2 (2016), 75–85.
[48]
Le Yu, Xiapu Luo, Xule Liu, and Tao Zhang. 2016. Can we trust the privacy policies of android apps?. In 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 538–549.
[49]
Razieh Nokhbeh Zaeem and K Suzanne Barber. 2020. The effect of the GDPR on privacy policies: Recent progress and future promise. ACM Transactions on Management Information Systems (TMIS) 12, 1(2020), 1–20.
[50]
Chaoliang Zhong, Ming Yang, and Jun Sun. 2019. Javascript code suggestion based on deep learning. In Proceedings of the 2019 3rd International Conference on Innovation in Artificial Intelligence. 145–149.
[51]
Sebastian Zimmeck and Steven M Bellovin. 2014. Privee: An architecture for automatically analyzing web privacy policies. In 23rd USENIX Security Symposium (USENIX Security 14). 1–16.
[52]
Sebastian Zimmeck, Ziqi Wang, Lieyong Zou, Roger Iyengar, Bin Liu, Florian Schaub, Shomir Wilson, Norman Sadeh, Steven Bellovin, and Joel Reidenberg. 2016. Automated analysis of privacy requirements for mobile apps. In 2016 AAAI Fall Symposium Series.

Cited By

View all
  • (2024)SoK: Technical Implementation and Human Impact of Internet Privacy Regulations2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00206(673-696)Online publication date: 19-May-2024
  • (2023)The AILA Methodology for Automated and Intelligent Likelihood Assignment in Risk AssessmentIEEE Access10.1109/ACCESS.2023.324533311(26170-26183)Online publication date: 2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
RAID '22: Proceedings of the 25th International Symposium on Research in Attacks, Intrusions and Defenses
October 2022
536 pages
ISBN:9781450397049
DOI:10.1145/3545948
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GDPR violation
  2. Web API
  3. inconsistent data collection
  4. privacy policy

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

RAID 2022

Acceptance Rates

Overall Acceptance Rate 43 of 173 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)99
  • Downloads (Last 6 weeks)8
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)SoK: Technical Implementation and Human Impact of Internet Privacy Regulations2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00206(673-696)Online publication date: 19-May-2024
  • (2023)The AILA Methodology for Automated and Intelligent Likelihood Assignment in Risk AssessmentIEEE Access10.1109/ACCESS.2023.324533311(26170-26183)Online publication date: 2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media