Guidelines for GDPR compliance in Big Data systems

https://doi.org/10.1016/j.jisa.2021.102896Get rights and content

Abstract

The implementation of the GDPR that aims at protecting European citizens’ privacy is still a real challenge. In particular, in Big Data systems where data are voluminous and heterogeneous, it is hard to track data evolution through its complex life cycle ranging from collection, ingestion, storage and analytics. In this context, from 2016 to 2021 research has been conducted and several security tools designed. However, they are either specific to particular applications or address partially the regulation articles. To identify the covered parts, the missed ones and the necessary metrics for comparing different works, we propose a framework for GDPR compliance. The framework identifies the main components for the regulation implementation by mapping requirements aligned with GDPR’s provisions to IT design requirements. Based on this framework, we compare the main GDPR solutions in the Big Data domain and we propose a guideline for GDPR verification and implementation in Big Data systems.

Introduction

The General Data Protection Regulation GDPR [1] sets new requirements on security and data protection through 99 articles and 173 recitals and aims to protect the rights and freedom of natural persons. Every organization that deals with personal data has to comply with GDPR to protect these rights and to be accountable while improving business models [2]. Accountability aims at demonstrating how controllers comply with data protection principles. Each organization must answer the following questions: what information is processed? why? how and where is data stored? who can access it and why? is it up-to-date and accurate? how long will you keep it for? how will it be safeguarded and how accountability should be reached?

Some previous works present how to extract technical requirements from law requirements [3] before the birth of GDPR. Nevertheless, no design patterns or best practices can be directly applied in the Big Data context for GDPR-compliance implementation.

In the last few years, topics about GDPR have been discussed across a range of academic publications and industry papers from different theoretical and practical perspectives, including numerous implementations and design concepts for GDPR compliance [4]. These works are still in their infancy with a limited scope. Fully developed and approved tools that implement GDPR articles are still missing especially in the area of Big Data analytics. The term “Big Data analytics” refers to the entire data management life cycle from ingestion and storage to analysis of high volumes of data with heterogeneous format from different sources. As presented in Fig. 1, the reference architecture of Big Data systems covers 5 main layers [5]: data sources, ingestion, processing, storage, distribution and services. At the processing layer, sophisticated algorithms are being developed to analyze a large amount of data to gain valuable insights for accurate decision-making, detecting unprecedented opportunities such as finding meaningful patterns, presuming situations, predicting and inferring behaviors. Due to the large data volume and the complexity of processing, tracking data dependencies and privacy verification are challenging. For this purpose, data security and governance layer is a cross-layer generally used for data security and management. Consequently, it represents a key part of the system in implementing GDPR requirements.

Recent academic and industrial tools [6], [7] implement some GDPR requirements translating automatically the privacy policies to software in order to provide accountability. However, these works address, only partially, GDPR principles and their related articles such as purpose limitation, data minimization, storage limitation, transparency or security [8], [9]. Other works concentrate on particular articles of the regulation [10] like the right to data portability, the right to be forgotten, the access right or the right to be informed [11]. Also, these works generally address one particular type of data source (logs, IoT sensors or classical SQL databases). It is not clear how to apply proposed solutions that consider uniform data, to Big Data architectures with multi-channel data sources, different purposes and intensive processing. Consequently, we still lack guidelines to verify GDPR compliance and to implement the regulation in a Big Data context. As a starting point, in order to address this issue, a comprehensive overview of the regulation and a common understanding of its key concepts are necessary. Afterward, the analysis of GDPR documentation and the study of recent works on privacy and GDPR allow the identification of the main privacy requirements and building blocks for GDPR compliance verification. As an outcome of this study, and based on the different experimentation carried out in the state of the art, we propose a framework with well-defined components to implement the regulation. According to these components, we situate the different works carried out on GDPR in the domain of Big Data. Furthermore, we provide an overview of how to use the framework to assist IT developers and Big Data system designers to build GDPR-compliant systems and applications. As an illustration of the framework usage, we consider the example of an e-health application, and we illustrate how we used the framework to help privacy by design implementation in the considered application. This paper’s contribution can be summarized as follows:

  • An analysis of GDPR principles and entities for a better understanding of the regulation by IT developers and Big Data system designers.

  • A translation from GDPR principles requirements to IT design requirements.

  • A framework for GDPR compliance verification and implementation in Big Data systems.

  • A classification of the state of the art conducted on GDPR solutions implemented between 2016 and 2021 in both academic and industrial areas.

  • A use case demonstrating the framework usage.

This work is an extended version of our previous work [12] which was restricted to a survey and a first version of the proposed framework. In this paper, we propose a translation from the regulation’s requirements to IT design requirements which allows us to have a more precise and fine-grained framework. Furthermore, an IoT use-case is proposed to illustrate the framework usage that helps us identify missing parts in the use-case management system. Furthermore, we extended the related works’ section and the GDPR tools section with recent solutions mainly from the industry. The up-to-date version of the studied solutions allows us to provide some key guidelines for GDPR implementation. Finally, the evaluation of the ameliorated solution shows an acceptable overhead when implementing GDPR-compliance.

This paper is structured as follows. Section 2 is an overview of GDPR principles and main entities. In Section 3, we presented the related works and we highlighted the contribution of this paper. Section 4 presents the problem statement and illustrates the main GDPR challenges in Big Data systems. In Section 5, we extract the main IT design requirements starting from GDPR principles and we describe our framework for GDPR compliance in Big data systems. In Section 6, we use the presented framework to classify GDPR tools for reuse purposes. We describe the framework used for GDPR-compliance implementation and evaluation in Section 7. Finally, Section 8 provides a summary of the main findings of this paper and highlights new opportunities for future work.

Section snippets

GDPR entities and principles

The GDPR aims at delivering harmonized, consistent and high-level data protection across Europe. It has 99 articles and 173 recitals grouped into 11 chapters. In those chapters, it addresses a set of principles, entities, obligations and legal requirements. GDPR is a complex law and hard to understand and analyze by Big Data system designers and IT developers. In this section, we will illustrate a big picture of GDPR requirements and entities through top down approach.

Related works

Works on GDPR compliance can be divided into 3 main categories: (1) GDPR analysis, (2) Frameworks for GDPR compliance and (3) IT tools for GDPR implementation.

The first category of works presents theoretical interpretations of GDPR. The second category is more technical and presents some guidelines to implement GDPR compliant systems. The last category is the scope of Section 6 about recent implementations and tools of GDPR in Big Data. In this section, we focus on the two first categories. The

Problem statement

In this section, we discuss the major GDPR challenges in Big Data systems [40]. From our perspective, the current GDPR and privacy challenges can be grouped into four categories following the architecture layers presented in Fig. 1 as follows:

  • Challenges in Data Sources layer: Regarding the privacy principles, both consent and purpose limitation principle must be considered before beginning the collection phase. Each data subject has the right to know the reasons behind collecting each data from

From GDPR principles to IT GDPR framework

To design GDPR-compliant systems, GDPR obligations have to be interpreted as technical requirements that are not straightforward. We need to represent a valid means to write simple and understandable requirements. Some academics started to address this step as soon as GDPR appeared such as in [42] and in [43], the study of these similar efforts helped us in identifying and confirming the right IT requirements. It is the scope of this section where we follow the steps presented in Fig. 2.

IT tools for GDPR implementation

GDPR-oriented tools are divided into 3 main categories: (1) Academic GDPR Tools, (2) Industrial GDPR tools and (3) Apache tools that are built-in Big Data solutions. The next sections summarize these three categories.

The framework implementation and application

In this section, we propose an implementation of the framework based on the tools studied in Section 6. We selected from Table 3 the best candidate technology chosen for each component in the context of our use case. Then, we present the framework application in order to ameliorate a previous work on GDPR-compliance in e-health systems. We describe the framework usage and evaluate its overhead on the application performance.

Our implementation is based on Apache Ranger [48] and Atlas [70].

Conclusion and future work

This work aims at helping IT designers and developers understand GDPR and implement GDPR-compliant Big Data systems. For this, we analyze GDPR requirements and translate them to IT design requirements. Then, a framework is proposed that details the main components for GDPR compliance verification and implementation. To implement this framework, we classified and compared different tools related to GDPR implementation in Big Data systems. This comparison is guided by the identified IT

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (79)

  • YuanB. et al.

    The policy effect of the general data protection regulation (GDPR) on the digital public health sector in the European union: An empirical investigation

    Int J Environ Res Public Health

    (2019)
  • EUG.

    General data protection regulation

    (2020)
  • PhamP.l.

    The applicability of the GDPR to the Internet of Things

    J Data Prot Priv

    (2019)
  • AkhigbeO. et al.

    Information technology artifacts in the regulatory compliance of business processes: a meta-analysis

  • D’AcquistoG. et al.

    Privacy by design in big data: an overview of privacy enhancing technologies in the era of big data analytics

    (2015)
  • SawantN. et al.

    Big data application architecture

  • BonattiP.A. et al.

    Big data and analytics in the age of the GDPR

  • HashicorpP.A.

    Hashicorp vault

    (2020)
  • Fischer-HübnerS. et al.

    Transparency, privacy and trust–Technology for tracking and controlling my data disclosures: Does this work?

  • RantosK. et al.

    ADvoCATE: a consent management platform for personal data processing in the IoT using blockchain technology

  • GjermundrødH. et al.

    PrivacyTracker: a privacy-by-design GDPR-compliant framework with verifiable data traceability controls

  • CrabtreeA. et al.

    Building accountability into the Internet of Things: the IoT Databox model

    J Reliab Intell Environ

    (2018)
  • Rhahla M, Allegue S, Abdellatif T. A framework for GDPR compliance in Big Data systems. In: 14 the international...
  • GDPR for ClouderaA.

    Compliance without complexity

    (2021)
  • RingmannS.D. et al.

    Requirements for legally compliant software based on the GDPR

  • UKI.

    Data protection by design and default

    (2020)
  • LopesI.M. et al.

    Improvement of the applicability of the general data protection regulation in health clinics

  • PedrosaM. et al.

    GDPR impacts and opportunities for computer-aided diagnosis Guidelines and legal perspectives

  • SchulzK. et al.

    Options to improve the general model of security management in private bank with GDPR compliance

  • Shah A, Banakar V, Shastri S, Wasserman M, Chidambaram V. Analyzing the impact of {GDPR} on storage systems. In: 11th...
  • GonçalvesA. et al.

    An approach to GDPR based on object role modeling

  • Krempel E, Beyerer J. The EU general data protection regulation and its effects on designing assistive environments....
  • MartinY.S. et al.

    Methods and tools for GDPR compliance through privacy and data protection engineering

  • SchneiderG.

    Is privacy by construction possible?

  • Burt KutM.

    Key tension points and design guidelines for GDPR compliance: Designing for a news service application

    (2018)
  • PanditH.J. et al.

    An exploration of data interoperability for GDPR

    Int J Stand Res

    (2018)
  • Peras D. Guidelines for GDPR compliant consent and data management model in ICT businesses. In: 29th international...
  • Tesfay WB, Hofmann P, Nakamura T, Kiyomoto S, Serna J. Privacyguide: Towards an implementation of the eu gdpr on...
  • BrodinM.

    A framework for GDPR compliance for small-and medium-sized enterprises

    European Journal for Security Research

    (2019)
  • BadiiC. et al.

    Smart city IoT platform respecting GDPR privacy and security aspects

    IEEE Access

    (2020)
  • bpr4gdprC.

    The business process re-engineering and functional toolkit for gdpr compliance project

    (2021)
  • DEFeNDC.

    The Defend project

    (2021)
  • SMOOTHC.

    The Smooth platform

    (2021)
  • PDP4EC.

    The pdp4e project

    (2021)
  • PAPAYAC.

    The papaya project

    (2021)
  • PoSeID-onC.

    The PoSeID-on project

    (2021)
  • de CarvalhoR.M. et al.

    Protecting citizens’ personal data and privacy: Joint effort from GDPR EU cluster research projects

    SN Comput Sci

    (2020)
  • TsohouA. et al.

    Privacy, security, legal and technology acceptance requirements for a GDPR compliance platform

  • PirasL. et al.

    DEFeND DSM: a data scope management service for model-based privacy by design GDPR compliance

  • Cited by (22)

    • Best practices and current implementation of emerging smartphone-based (bio)sensors – Part 1: Data handling and ethics

      2023, TrAC - Trends in Analytical Chemistry
      Citation Excerpt :

      However, consent becomes less clear for big data; it can be difficult to ‘opt out’ from a data analytics set, especially when ‘opting out’ of a dataset could identify a company or individual. Despite this, metadata can be used for big data purposes so long as appropriate safeguards ensure compliance with the GDPR [170]. The guiding principles of FAIR (Findability, Accessibility, Interoperability, and Reuse) [203] provide a solid basis for ethical metadata collection that could be useful for emerging SbSs [204].

    • Automatic number plate recognition (ANPR) in smart cities: A systematic review on technological advancements and application cases

      2022, Cities
      Citation Excerpt :

      Based on the literature and policy surveys, although critical attention has been paid and related techniques, such as hashing and salting techniques (Wang & Tucker, 2017), have been developed for years, comprehensive discussions on the anonymization in ANPR (Spanu et al., 2021), in accordance with regulations such as General Data Protection Regulation (GDPR) in Europe, is still relatively underdeveloped. From a perspective of public response, many studies highlighted the need for a deeper discussion in society regarding the tension between identity protection and security applications such as tracing the illegal use of motor vehicles (Chen et al., 2021; Haines & Wells, 2012; Rhahla et al., 2021). In terms of the acceptability of ANPR systems from the driver's perspective, a good example comes from Wong (2007). >

    View all citing articles on Scopus

    This project is carried out under the MOBIDOC scheme, funded by the EU through the EMORI program and managed by the ANPR. This work was, also, partially supported by the VRR Tunisien project, funded by Tunisien Ministry of Higher Education and Scientific Research .

    View full text