Big genetic data and its big data protection challenges

doi:10.1016/j.clsr.2018.05.028

Computer Law & Security Review

Volume 34, Issue 5, October 2018, Pages 1000-1018

https://doi.org/10.1016/j.clsr.2018.05.028 Get rights and content

Abstract

The use of various forms of big data have revolutionised scientific research. This includes research in the field of genetics in areas ranging from medical research to anthropology. Developments in this area have inter alia been characterised by the ability to sequence genome wide sequences (GWS) cheaply, the ability to share and combine with other forms of complimentary data and ever more powerful processing techniques that have become possible given tremendous increases in computing power. Given that many if not most of these techniques will make use of personal data it is necessary to take into account data protection law. This article looks at challenges for researchers that will be presented by the EU's General Data Protection Regulation, which will be in effect from May 2018. The very nature of research with big data in general and genetic data in particular means that in many instances compliance will be onerous, whilst in others it may even be difficult to envisage how compliance may be possible. Compliance concerns include issues relating to ‘purpose limitation’, ‘data minimisation’ and ‘storage limitation’. Other requirements, including the need to facilitate data subject rights and potentially conduct a Data Protection Impact Assessment (DPIA) may provide further complications for researchers. Further critical issues to consider include the choice of legal base: whether to opt for what is often seen as the ‘default option’ (i.e. consent) or to process under the so called ‘scientific research exception’. Each presents its own challenges (including the likely need to gain ethical approval) and opportunities that will have to be considered according to the particular context in question.

Introduction

The use of genetic data in research has been undergoing a fundamental shift. Researchers are no longer restricted to working with relatively small samples of individual genomes (for example DNA relating to a gene known to effect disease aetiology) but now work with various markers scattered across the entire genome. This type of data is used in various areas of research including efforts to discover new disease variants or to increase understanding of evolutionary processes. The field of bioinformatics and computational genetics has evolved inter alia to allow researchers to focus on detailed ‘high-depth’ sequencing of the entire genome of individuals allowed by advances in genome sequencing technology and computing power. These advances mean that an individual's genome can be sequenced relatively quickly and cheaply (costing less than a MRI scan in a local hospital). Powerful software has furthermore been developed to analyse such genome wide sequences (GWSs). The research potential of such techniques has been complimented by the ability to share and combine GWS data with a range of potential complimentary data sets (e.g. electronic health records). These developments have ushered in a world of ‘big data genomics’ where researchers carry out complex data mining operations on the entire genomes of individuals and groups of individuals.

Whilst these developments promise to permit great leaps forward in our understanding of the human genome and its relationship to various important issues (not least to human disease), they also pose new risks in terms of privacy related harms. These include harms not only to the individuals providing the genetic samples in question but even to those who may be related to them.¹ Complying with laws relating to privacy, and in particular to data protection will therefore be a serious issue for researchers conducting research on large samples of genetic data. This article aims to illustrate a number of these issues, highlighting some of the major challenges that the data protection framework poses for researchers active in the use of big genetic data.² It will focus on compliance with the EU’s new General Data Protection Regulation (GDPR), which comes into effect across the EU from May 2018. In doing so this paper will use several prominent examples from documented research practice in the area of computational genetics. The authors will illustrate how common practices in this area may be difficult to reconcile with the key pillars of data protection, including the need to have a valid legal ground for processing personal data, the need to respect data processing principles and the need to facilitate data protection rights. As this paper suggests, such burdens may mean that compliance with the EU’s data protection regime (including under the new General Data Protection Regulation) may not only be cumbersome but may, in many cases, be difficult even to envisage given the aims of big genetic data processing for research.

Section 2 of this paper will briefly introduce the concept of ‘big genetic data’ and discuss how researchers can use it. Sections 3 and 4 will look at how, given the nature of modern computational genetics', genetic data used in research is likely not only be to be of a personal nature, (i.e. rarely anonymous in nature) but also categorised as ‘sensitive' or ‘special' data also. Section 5 will look at how the need to respect data processing principles will present difficulties for researchers involved in computational genetics. Section 6 will look at the issue of data protection impact assessments, something that will be obligatory (and potentially onerous) for many forms of research given the sensitive (or special) nature of genetic data. Section 7 will analyse how the need to facilitate data subject rights may create major obstacles for researchers involved in the use of big genetic data. The issues surrounding the use of both consent and the scientific research exception as a legal base for processing will be discussed in Sections 8 and 9 respectfully. The requirements of each may mean that on many occasions the latter is more suitable, though as Section 9 discusses this may be something researchers (including in areas of computational genetics) have difficulty in convincing ethics committees of, presenting further problems for research in this area.

Section snippets

Big genetic data and its use in research

Genetic data originates from human tissue or other biological samples. These range from blood, saliva and urine samples taken from individuals to tissues taken from cadavers in ancient DNA studies to soil, water and rock samples in environmental DNA studies.³

It is becoming easier to link genetic data to specific individuals

Personal data is data that can likely be linked to an identifiable individual. Data that cannot be linked to an individual is not personal data and is not governed by the EU data protection framework.⁹ Consequently, those involved in processing such data will not have to comply with its requirements. Where possible, researchers have in the past tended to claim that genetic data was not personal data in order to avoid the need for compliance with data protection regulations. This

Personal genetic data is always sensitive data

Personal data that is sensitive in nature attracts a higher regulatory burden than non-sensitive data. The legal situation concerning genetic data is in a situation of flux. This is because the GDPR explicitly describes genetic data as ‘special' (i.e. sensitive) data.³² This was not the case with Directive 95/46/EC. It did not define what genetic data was or what legal value it had. The Article 29 Working Party opinion on genetic data³³

Data processing principles cannot be consented away

The data protection principles contained within the data protection framework are of crucial importance given that, in general, they must be adhered to in all cases of processing of personal data.⁴⁵ It is not possible for example for individuals to consent away the need to adhere to the data protection principles. Requirements such as accuracy, purpose

The need for an impact assessment

One of the novel requirements of the GDPR is the need perform a ‘Data Protection Impact Assessment’ (DPIA) in a number of circumstances where the proposed processing may “represent a high risk to the rights and freedoms of natural persons”.⁵⁹ The GDPR does not exhaustively describe all the situations where a data protection impact assessment is required but does describe certain occasions where it shall be required, including situations that require “processing on a large

The need to facilitate data subject rights

Data subject rights allow data subjects to ensure that their data is being processed both fairly and lawfully and, in a number of situations to exercise a level of autonomy over the processing of their personal data.⁶⁵

Researchers have a choice of legal base

A sine qua non for the processing of personal data is the existence of a legal basis for processing given its context and purpose. As with its predecessor, the GDPR sets out a (expanded) number of potential legal bases that can be used to justify the processing of personal data.⁸²

An alternative to consent as legal basis

In addition to `explicit consent', another potentially relevant legal base is where such processing may be in the “public interest”.¹⁰¹ This provision has thus far been used by Member States in their transposition of Directive 95/46/EC (and in other legislation) to permit processing of sensitive data for a range of purposes, including for scientific research.¹⁰²

The critical role of ethics bodies

Despite the clear existence of a legal ground for the processing of sensitive data for research purposes that does not require consent, regulatory authorities and ethics bodies have, in many cases, been reticent to use this option, preferring to insist that researchers obtain consent or use anonymised data.¹¹⁹

Conclusion

Computational genetics is undergoing a revolution. A number of developments have fuelled this revolution. Chief amongst these is the increasing ability to produce (rapidly and for low cost) GWSs. These can be mined repeatedly because of increases in computing power. The possibility to access and share various forms of potentially compatible information throughout the online-connected world have not only allowed for more research opportunities but also changed the way we view genetic data in

References (55)

F. Aldhouse
Anonymisation of personal data - A missed opportunity for the European Commission
Comput Law Secur Rev
(2014)
M. Dawn Teare et al.
Genetic linkage studies
Lancet North Am Ed
(2005)
D. Hallinan et al.
Genetic data and the data protection regulation: anonymity, multiple subjects, sensitivity and a prohibitionary logic regarding genetic data?
Comput Law Security Rev
(2013)
A.L. Mcguire et al.
DNA data sharing: research participants' perspectives
Genet Med
(2008)
E. Niemiec et al.
Ethical issues in consumer genome sequencing: use of consumers' samples and data
Appl Transl Genet
(2016)
M.M. Al Aziz et al.
Secure and efficient multiparty computation on genomic data
Proceedings of the 20th International Database Engineering & Applications Symposium
(2016)
V. Amin et al.
Does more schooling improve health outcomes and health related behaviors? Evidence from U.K. twins
Econ Educ Rev
(2013)
L. Andrews
Social, legal, and ethical implications of genetic testing
(1994)
A. Auton et al.
The 1000 Genomes Project, C. A global reference for human genetic variation
Nature
(2015)
P. Boddington et al.
Consent forms in genomics: the difference between law and practice
Eur J Health Law
(2011)

J. Bohannon

Genealogy databases enable naming of anonymous DNA donors

Science

(2013)

J. Butler

The future of forensic DNA analysis

Phil Trans R Soc

(2015)

CaiR. et al.

Deterministic identification of specific individuals from GWAS results

Bioinformatics

(2015)

P. Carter et al.

The social licence for research: why care data ran into trouble

J Med Ethics

(2015)

G. Chassang

The impact of the EU general data protection regulation on scientific research

Ecancermedicalscience

(2017)

E. Clayton et al.

Frontotemporal dementia caused by CHMP2B mutation is characterised by neuronal lysosomal storage pathology

Acta Neuropathol (Berl)

(2015)

T.U.K. Consortium

The UK10K project identifies rare variants in health and disease

Nature

(2015)

P. De Hert et al.

Privacy, data protection and law enforcement. opacety of the individual and transparency of the power

K. Deribe et al.

Mapping the geographical distribution of podoconiosis in Cameroon using parasitological, serological, and clinical evidence to exclude other causes of lymphedema

PLoS NeglTrop Dis

(2018)

L. Dubois et al.

Genetic and environmental contributions to weight, height, and bmi from birth to 19 years of age: an international study of over 12,000 twin pairs

PLoS One

(2012)

R. Fears et al.

Data protection regulaiton and the promotion of health research: getting the balance right

Q J Med

(2014)

M. Friedewald et al.

Open consent, biobanking and data protection law: can open consent be ‘informed’ under the forthcoming data protection regulation?

Life Sci Soc Policy

(2015)

N. Ghani et al.

Big data and data protection - issues with purpose limitation principle

Int J Adv Soft Comput Appl

(2016)

S. Gutwirth et al.

European data protection: in good health?

(2012)

M. Gymrek et al.

Identifying personal genomes by surname inference

Science

(2013)

HeK.Y. et al.

Big data analytics for genomic medicine

Int J Mol Sci

(2017)

HongE.P. et al.

Sample size and statistical power calculation in genetic association studies

Genom Inform

(2012)

Cited by (23)

Towards a privacy impact assessment methodology to support the requirements of the general data protection regulation in a big data analytics context: A systematic literature review
2022, Computer Law and Security Review
Citation Excerpt :
We identified 13 established PIA methodologies in our publication sample (Table 4). Twenty articles referred to the DPIA imposed by the GDPR (i.e. (Bu-Pasha, 2020; Bisztray and Gruschka, 2019; Coles et al., 2018; Crockett et al., 2018; Custers et al., 2018; Raphaël Gellert, 2018; Drewer and Miladinova, 2017; Easton, 2017; Raphael Gellert, 2017; Gonçalves, 2017; Edwards et al., 2016; Mantelero, 2014; Notario et al., 2015; Puijenbroek and Hoepman, 2017; Quinn and Quinn, 2018; Todde et al., 2020; van Dijk et al., 2016; Wei et al., 2020; Wright and Raab, 2014; Yordanov, 2017)). The EU DPIA has likely received interest with the introduction of the GDPR as the new data protection regulation in Europe and because it mandates impact assessments for privacy-vulnerable data processing operations.
Big Data Analytics enables today's businesses and organisations to process and utilise the raw data that is generated on a daily basis. While Big Data Analytics has improved efficiency and created many opportunities, it has also increased the risk of personal data being compromised or breached. The General Data Protection Regulation (GDPR) mandates Data Protection Impact Assessment (DPIA) as a means of identifying appropriate controls to mitigate risks associated with the protection of personal data. However, little is currently known about how to conduct such a DPIA in a Big Data Analytics context. To this end, we conducted a systematic literature review with the aim of identifying privacy and data protection risks specific to the Big Data Analytics context that could negatively impact individuals' rights and freedoms when they occur. Based on a sample of 159 articles, we applied a thematic analysis to all identified risks which resulted in the definition of nine Privacy Touch Points that summarise the identified risks. The coverage of these Privacy Touch Points was then analysed for ten Privacy Impact Assessment (PIA) methodologies. The insights gained from our analysis will inform the next phase of our research, in which we aim to develop a comprehensive DPIA methodology that will enable data processors and data controllers to identify, analyse and mitigate privacy and data protection risks when storing and processing data involving Big Data Analytics.
Conducting research with school children and data in line with “ethical principles” lawyers at work in the ethics management of the H2020 mathisis project
2020, Computer Law and Security Review
Recent advancements in human-computer interaction, machine learning and in artificial intelligence hold the potential to influence both the curriculum and the pedagogy of school children. While the impacts of new technologies remain uncertain, ongoing research and innovation projects are already developing and testing such technologies in schools. This article builds on the experience of the authors as advisors for a Horizon 2020 (H2020) project conducting research with schoolchildren in twenty schools across the United Kingdom, Italy and Spain (the project MaTHiSiS). This contribution presents and discusses how the authors lived up to the obligation of conducting research in line with “ethical principles”.
Data protection, scientific research, and the role of information
2020, Computer Law and Security Review
This paper aims to critically assess the information duties set out in the General Data Protection Regulation (GDPR) and national adaptations when the purpose of processing is scientific research. Due to the peculiarities of the legal regime applicable to the research context information about the processing plays a crucial role for data subjects. However, the analysis points out that the information obligations, or mandated disclosures, introduced in the GDPR are not entirely satisfying and present some flaws.
In addition, the GDPR information duties risk suffering from the same shortcomings usually addressed in the literature about mandated disclosures. The paper argues that the principle of transparency, developed as a “user-centric” concept, can support the adoption of solutions that embed behavioural insights to support the rationale of the information provision better.
Differential Data Protection Regimes in Data-Driven Research: Why the GDPR is More Research-Friendly Than You Think
2022, German Law Journal
Challenges in big data adoption for Malaysian organizations: a review
2024, Indonesian Journal of Electrical Engineering and Computer Science
Limits to Health Data Access Body Discretion and a Need to Comply with the Gdpr – Enough to Protect Against Improper Sharing of Health Data Through the Ehds?
2023, SSRN

View all citing articles on Scopus

View full text

Big genetic data and its big data protection challenges

Abstract

Introduction

Section snippets

Big genetic data and its use in research

It is becoming easier to link genetic data to specific individuals

Personal genetic data is always sensitive data

Data processing principles cannot be consented away

The need for an impact assessment

The need to facilitate data subject rights

Researchers have a choice of legal base

An alternative to consent as legal basis

The critical role of ethics bodies

Conclusion

Comput Law Secur Rev

Lancet North Am Ed

Comput Law Security Rev

Genet Med

Appl Transl Genet

Secure and efficient multiparty computation on genomic data

Proceedings of the 20th International Database Engineering & Applications Symposium

Does more schooling improve health outcomes and health related behaviors? Evidence from U.K. twins

Econ Educ Rev

Social, legal, and ethical implications of genetic testing

The 1000 Genomes Project, C. A global reference for human genetic variation

Nature

Consent forms in genomics: the difference between law and practice

Eur J Health Law

Genealogy databases enable naming of anonymous DNA donors

Science

The future of forensic DNA analysis

Phil Trans R Soc

Deterministic identification of specific individuals from GWAS results

Bioinformatics

The social licence for research: why care data ran into trouble

J Med Ethics

The impact of the EU general data protection regulation on scientific research

Ecancermedicalscience

Frontotemporal dementia caused by CHMP2B mutation is characterised by neuronal lysosomal storage pathology

Acta Neuropathol (Berl)

The UK10K project identifies rare variants in health and disease

Nature

Privacy, data protection and law enforcement. opacety of the individual and transparency of the power

Mapping the geographical distribution of podoconiosis in Cameroon using parasitological, serological, and clinical evidence to exclude other causes of lymphedema

PLoS NeglTrop Dis

Genetic and environmental contributions to weight, height, and bmi from birth to 19 years of age: an international study of over 12,000 twin pairs

PLoS One

Data protection regulaiton and the promotion of health research: getting the balance right

Q J Med

Open consent, biobanking and data protection law: can open consent be ‘informed’ under the forthcoming data protection regulation?

Life Sci Soc Policy

Big data and data protection - issues with purpose limitation principle

Int J Adv Soft Comput Appl

European data protection: in good health?

Identifying personal genomes by surname inference

Science

Big data analytics for genomic medicine

Int J Mol Sci

Sample size and statistical power calculation in genetic association studies

Genom Inform