Guidelines for GDPR compliance in Big Data systems☆
Introduction
The General Data Protection Regulation GDPR [1] sets new requirements on security and data protection through 99 articles and 173 recitals and aims to protect the rights and freedom of natural persons. Every organization that deals with personal data has to comply with GDPR to protect these rights and to be accountable while improving business models [2]. Accountability aims at demonstrating how controllers comply with data protection principles. Each organization must answer the following questions: what information is processed? why? how and where is data stored? who can access it and why? is it up-to-date and accurate? how long will you keep it for? how will it be safeguarded and how accountability should be reached?
Some previous works present how to extract technical requirements from law requirements [3] before the birth of GDPR. Nevertheless, no design patterns or best practices can be directly applied in the Big Data context for GDPR-compliance implementation.
In the last few years, topics about GDPR have been discussed across a range of academic publications and industry papers from different theoretical and practical perspectives, including numerous implementations and design concepts for GDPR compliance [4]. These works are still in their infancy with a limited scope. Fully developed and approved tools that implement GDPR articles are still missing especially in the area of Big Data analytics. The term “Big Data analytics” refers to the entire data management life cycle from ingestion and storage to analysis of high volumes of data with heterogeneous format from different sources. As presented in Fig. 1, the reference architecture of Big Data systems covers 5 main layers [5]: data sources, ingestion, processing, storage, distribution and services. At the processing layer, sophisticated algorithms are being developed to analyze a large amount of data to gain valuable insights for accurate decision-making, detecting unprecedented opportunities such as finding meaningful patterns, presuming situations, predicting and inferring behaviors. Due to the large data volume and the complexity of processing, tracking data dependencies and privacy verification are challenging. For this purpose, data security and governance layer is a cross-layer generally used for data security and management. Consequently, it represents a key part of the system in implementing GDPR requirements.
Recent academic and industrial tools [6], [7] implement some GDPR requirements translating automatically the privacy policies to software in order to provide accountability. However, these works address, only partially, GDPR principles and their related articles such as purpose limitation, data minimization, storage limitation, transparency or security [8], [9]. Other works concentrate on particular articles of the regulation [10] like the right to data portability, the right to be forgotten, the access right or the right to be informed [11]. Also, these works generally address one particular type of data source (logs, IoT sensors or classical SQL databases). It is not clear how to apply proposed solutions that consider uniform data, to Big Data architectures with multi-channel data sources, different purposes and intensive processing. Consequently, we still lack guidelines to verify GDPR compliance and to implement the regulation in a Big Data context. As a starting point, in order to address this issue, a comprehensive overview of the regulation and a common understanding of its key concepts are necessary. Afterward, the analysis of GDPR documentation and the study of recent works on privacy and GDPR allow the identification of the main privacy requirements and building blocks for GDPR compliance verification. As an outcome of this study, and based on the different experimentation carried out in the state of the art, we propose a framework with well-defined components to implement the regulation. According to these components, we situate the different works carried out on GDPR in the domain of Big Data. Furthermore, we provide an overview of how to use the framework to assist IT developers and Big Data system designers to build GDPR-compliant systems and applications. As an illustration of the framework usage, we consider the example of an e-health application, and we illustrate how we used the framework to help privacy by design implementation in the considered application. This paper’s contribution can be summarized as follows:
- •
An analysis of GDPR principles and entities for a better understanding of the regulation by IT developers and Big Data system designers.
- •
A translation from GDPR principles requirements to IT design requirements.
- •
A framework for GDPR compliance verification and implementation in Big Data systems.
- •
A classification of the state of the art conducted on GDPR solutions implemented between 2016 and 2021 in both academic and industrial areas.
- •
A use case demonstrating the framework usage.
This work is an extended version of our previous work [12] which was restricted to a survey and a first version of the proposed framework. In this paper, we propose a translation from the regulation’s requirements to IT design requirements which allows us to have a more precise and fine-grained framework. Furthermore, an IoT use-case is proposed to illustrate the framework usage that helps us identify missing parts in the use-case management system. Furthermore, we extended the related works’ section and the GDPR tools section with recent solutions mainly from the industry. The up-to-date version of the studied solutions allows us to provide some key guidelines for GDPR implementation. Finally, the evaluation of the ameliorated solution shows an acceptable overhead when implementing GDPR-compliance.
This paper is structured as follows. Section 2 is an overview of GDPR principles and main entities. In Section 3, we presented the related works and we highlighted the contribution of this paper. Section 4 presents the problem statement and illustrates the main GDPR challenges in Big Data systems. In Section 5, we extract the main IT design requirements starting from GDPR principles and we describe our framework for GDPR compliance in Big data systems. In Section 6, we use the presented framework to classify GDPR tools for reuse purposes. We describe the framework used for GDPR-compliance implementation and evaluation in Section 7. Finally, Section 8 provides a summary of the main findings of this paper and highlights new opportunities for future work.
Section snippets
GDPR entities and principles
The GDPR aims at delivering harmonized, consistent and high-level data protection across Europe. It has 99 articles and 173 recitals grouped into 11 chapters. In those chapters, it addresses a set of principles, entities, obligations and legal requirements. GDPR is a complex law and hard to understand and analyze by Big Data system designers and IT developers. In this section, we will illustrate a big picture of GDPR requirements and entities through top down approach.
Related works
Works on GDPR compliance can be divided into 3 main categories: (1) GDPR analysis, (2) Frameworks for GDPR compliance and (3) IT tools for GDPR implementation.
The first category of works presents theoretical interpretations of GDPR. The second category is more technical and presents some guidelines to implement GDPR compliant systems. The last category is the scope of Section 6 about recent implementations and tools of GDPR in Big Data. In this section, we focus on the two first categories. The
Problem statement
In this section, we discuss the major GDPR challenges in Big Data systems [40]. From our perspective, the current GDPR and privacy challenges can be grouped into four categories following the architecture layers presented in Fig. 1 as follows:
- •
Challenges in Data Sources layer: Regarding the privacy principles, both consent and purpose limitation principle must be considered before beginning the collection phase. Each data subject has the right to know the reasons behind collecting each data from
From GDPR principles to IT GDPR framework
To design GDPR-compliant systems, GDPR obligations have to be interpreted as technical requirements that are not straightforward. We need to represent a valid means to write simple and understandable requirements. Some academics started to address this step as soon as GDPR appeared such as in [42] and in [43], the study of these similar efforts helped us in identifying and confirming the right IT requirements. It is the scope of this section where we follow the steps presented in Fig. 2.
IT tools for GDPR implementation
GDPR-oriented tools are divided into 3 main categories: (1) Academic GDPR Tools, (2) Industrial GDPR tools and (3) Apache tools that are built-in Big Data solutions. The next sections summarize these three categories.
The framework implementation and application
In this section, we propose an implementation of the framework based on the tools studied in Section 6. We selected from Table 3 the best candidate technology chosen for each component in the context of our use case. Then, we present the framework application in order to ameliorate a previous work on GDPR-compliance in e-health systems. We describe the framework usage and evaluate its overhead on the application performance.
Our implementation is based on Apache Ranger [48] and Atlas [70].
Conclusion and future work
This work aims at helping IT designers and developers understand GDPR and implement GDPR-compliant Big Data systems. For this, we analyze GDPR requirements and translate them to IT design requirements. Then, a framework is proposed that details the main components for GDPR compliance verification and implementation. To implement this framework, we classified and compared different tools related to GDPR implementation in Big Data systems. This comparison is guided by the identified IT
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (79)
- et al.
The policy effect of the general data protection regulation (GDPR) on the digital public health sector in the European union: An empirical investigation
Int J Environ Res Public Health
(2019) General data protection regulation
(2020)The applicability of the GDPR to the Internet of Things
J Data Prot Priv
(2019)- et al.
Information technology artifacts in the regulatory compliance of business processes: a meta-analysis
- et al.
Privacy by design in big data: an overview of privacy enhancing technologies in the era of big data analytics
(2015) - et al.
Big data application architecture
- et al.
Big data and analytics in the age of the GDPR
Hashicorp vault
(2020)- et al.
Transparency, privacy and trust–Technology for tracking and controlling my data disclosures: Does this work?
- et al.
ADvoCATE: a consent management platform for personal data processing in the IoT using blockchain technology
PrivacyTracker: a privacy-by-design GDPR-compliant framework with verifiable data traceability controls
Building accountability into the Internet of Things: the IoT Databox model
J Reliab Intell Environ
Compliance without complexity
Requirements for legally compliant software based on the GDPR
Data protection by design and default
Improvement of the applicability of the general data protection regulation in health clinics
GDPR impacts and opportunities for computer-aided diagnosis Guidelines and legal perspectives
Options to improve the general model of security management in private bank with GDPR compliance
An approach to GDPR based on object role modeling
Methods and tools for GDPR compliance through privacy and data protection engineering
Is privacy by construction possible?
Key tension points and design guidelines for GDPR compliance: Designing for a news service application
An exploration of data interoperability for GDPR
Int J Stand Res
A framework for GDPR compliance for small-and medium-sized enterprises
European Journal for Security Research
Smart city IoT platform respecting GDPR privacy and security aspects
IEEE Access
The business process re-engineering and functional toolkit for gdpr compliance project
The Defend project
The Smooth platform
The pdp4e project
The papaya project
The PoSeID-on project
Protecting citizens’ personal data and privacy: Joint effort from GDPR EU cluster research projects
SN Comput Sci
Privacy, security, legal and technology acceptance requirements for a GDPR compliance platform
DEFeND DSM: a data scope management service for model-based privacy by design GDPR compliance
Cited by (22)
Exploring the impact of GDPR on big data analytics operations in the E-commerce industry
2023, Procedia Computer ScienceBest practices and current implementation of emerging smartphone-based (bio)sensors – Part 1: Data handling and ethics
2023, TrAC - Trends in Analytical ChemistryCitation Excerpt :However, consent becomes less clear for big data; it can be difficult to ‘opt out’ from a data analytics set, especially when ‘opting out’ of a dataset could identify a company or individual. Despite this, metadata can be used for big data purposes so long as appropriate safeguards ensure compliance with the GDPR [170]. The guiding principles of FAIR (Findability, Accessibility, Interoperability, and Reuse) [203] provide a solid basis for ethical metadata collection that could be useful for emerging SbSs [204].
Automatic number plate recognition (ANPR) in smart cities: A systematic review on technological advancements and application cases
2022, CitiesCitation Excerpt :Based on the literature and policy surveys, although critical attention has been paid and related techniques, such as hashing and salting techniques (Wang & Tucker, 2017), have been developed for years, comprehensive discussions on the anonymization in ANPR (Spanu et al., 2021), in accordance with regulations such as General Data Protection Regulation (GDPR) in Europe, is still relatively underdeveloped. From a perspective of public response, many studies highlighted the need for a deeper discussion in society regarding the tension between identity protection and security applications such as tracing the illegal use of motor vehicles (Chen et al., 2021; Haines & Wells, 2012; Rhahla et al., 2021). In terms of the acceptability of ANPR systems from the driver's perspective, a good example comes from Wong (2007). >
PACTA: An IoT Data Privacy Regulation Compliance Scheme Using TEE and Blockchain
2024, IEEE Internet of Things JournalCompliance with HIPAA and GDPR in Certificateless-Based Authenticated Key Agreement Using Extended Chaotic Maps
2023, Electronics (Switzerland)
- ☆
This project is carried out under the MOBIDOC scheme, funded by the EU through the EMORI program and managed by the ANPR. This work was, also, partially supported by the VRR Tunisien project, funded by Tunisien Ministry of Higher Education and Scientific Research .