Big genetic data and its big data protection challenges

https://doi.org/10.1016/j.clsr.2018.05.028Get rights and content

Abstract

The use of various forms of big data have revolutionised scientific research. This includes research in the field of genetics in areas ranging from medical research to anthropology. Developments in this area have inter alia been characterised by the ability to sequence genome wide sequences (GWS) cheaply, the ability to share and combine with other forms of complimentary data and ever more powerful processing techniques that have become possible given tremendous increases in computing power. Given that many if not most of these techniques will make use of personal data it is necessary to take into account data protection law. This article looks at challenges for researchers that will be presented by the EU's General Data Protection Regulation, which will be in effect from May 2018. The very nature of research with big data in general and genetic data in particular means that in many instances compliance will be onerous, whilst in others it may even be difficult to envisage how compliance may be possible. Compliance concerns include issues relating to ‘purpose limitation’, ‘data minimisation’ and ‘storage limitation’. Other requirements, including the need to facilitate data subject rights and potentially conduct a Data Protection Impact Assessment (DPIA) may provide further complications for researchers. Further critical issues to consider include the choice of legal base: whether to opt for what is often seen as the ‘default option’ (i.e. consent) or to process under the so called ‘scientific research exception’. Each presents its own challenges (including the likely need to gain ethical approval) and opportunities that will have to be considered according to the particular context in question.

Introduction

The use of genetic data in research has been undergoing a fundamental shift. Researchers are no longer restricted to working with relatively small samples of individual genomes (for example DNA relating to a gene known to effect disease aetiology) but now work with various markers scattered across the entire genome. This type of data is used in various areas of research including efforts to discover new disease variants or to increase understanding of evolutionary processes. The field of bioinformatics and computational genetics has evolved inter alia to allow researchers to focus on detailed ‘high-depth’ sequencing of the entire genome of individuals allowed by advances in genome sequencing technology and computing power. These advances mean that an individual's genome can be sequenced relatively quickly and cheaply (costing less than a MRI scan in a local hospital). Powerful software has furthermore been developed to analyse such genome wide sequences (GWSs). The research potential of such techniques has been complimented by the ability to share and combine GWS data with a range of potential complimentary data sets (e.g. electronic health records). These developments have ushered in a world of ‘big data genomics’ where researchers carry out complex data mining operations on the entire genomes of individuals and groups of individuals.

Whilst these developments promise to permit great leaps forward in our understanding of the human genome and its relationship to various important issues (not least to human disease), they also pose new risks in terms of privacy related harms. These include harms not only to the individuals providing the genetic samples in question but even to those who may be related to them.1 Complying with laws relating to privacy, and in particular to data protection will therefore be a serious issue for researchers conducting research on large samples of genetic data. This article aims to illustrate a number of these issues, highlighting some of the major challenges that the data protection framework poses for researchers active in the use of big genetic data.2 It will focus on compliance with the EU’s new General Data Protection Regulation (GDPR), which comes into effect across the EU from May 2018. In doing so this paper will use several prominent examples from documented research practice in the area of computational genetics. The authors will illustrate how common practices in this area may be difficult to reconcile with the key pillars of data protection, including the need to have a valid legal ground for processing personal data, the need to respect data processing principles and the need to facilitate data protection rights. As this paper suggests, such burdens may mean that compliance with the EU’s data protection regime (including under the new General Data Protection Regulation) may not only be cumbersome but may, in many cases, be difficult even to envisage given the aims of big genetic data processing for research.

Section 2 of this paper will briefly introduce the concept of ‘big genetic data’ and discuss how researchers can use it. Sections 3 and 4 will look at how, given the nature of modern computational genetics', genetic data used in research is likely not only be to be of a personal nature, (i.e. rarely anonymous in nature) but also categorised as ‘sensitive' or ‘special' data also. Section 5 will look at how the need to respect data processing principles will present difficulties for researchers involved in computational genetics. Section 6 will look at the issue of data protection impact assessments, something that will be obligatory (and potentially onerous) for many forms of research given the sensitive (or special) nature of genetic data. Section 7 will analyse how the need to facilitate data subject rights may create major obstacles for researchers involved in the use of big genetic data. The issues surrounding the use of both consent and the scientific research exception as a legal base for processing will be discussed in Sections 8 and 9 respectfully. The requirements of each may mean that on many occasions the latter is more suitable, though as Section 9 discusses this may be something researchers (including in areas of computational genetics) have difficulty in convincing ethics committees of, presenting further problems for research in this area.

Section snippets

Big genetic data and its use in research

Genetic data originates from human tissue or other biological samples. These range from blood, saliva and urine samples taken from individuals to tissues taken from cadavers in ancient DNA studies to soil, water and rock samples in environmental DNA studies.3

It is becoming easier to link genetic data to specific individuals

Personal data is data that can likely be linked to an identifiable individual. Data that cannot be linked to an individual is not personal data and is not governed by the EU data protection framework.9 Consequently, those involved in processing such data will not have to comply with its requirements. Where possible, researchers have in the past tended to claim that genetic data was not personal data in order to avoid the need for compliance with data protection regulations. This

Personal genetic data is always sensitive data

Personal data that is sensitive in nature attracts a higher regulatory burden than non-sensitive data. The legal situation concerning genetic data is in a situation of flux. This is because the GDPR explicitly describes genetic data as ‘special' (i.e. sensitive) data.32 This was not the case with Directive 95/46/EC. It did not define what genetic data was or what legal value it had. The Article 29 Working Party opinion on genetic data33

Data processing principles cannot be consented away

The data protection principles contained within the data protection framework are of crucial importance given that, in general, they must be adhered to in all cases of processing of personal data.45 It is not possible for example for individuals to consent away the need to adhere to the data protection principles. Requirements such as accuracy, purpose

The need for an impact assessment

One of the novel requirements of the GDPR is the need perform a ‘Data Protection Impact Assessment’ (DPIA) in a number of circumstances where the proposed processing may “represent a high risk to the rights and freedoms of natural persons”.59 The GDPR does not exhaustively describe all the situations where a data protection impact assessment is required but does describe certain occasions where it shall be required, including situations that require “processing on a large

The need to facilitate data subject rights

Data subject rights allow data subjects to ensure that their data is being processed both fairly and lawfully and, in a number of situations to exercise a level of autonomy over the processing of their personal data.65

Researchers have a choice of legal base

A sine qua non for the processing of personal data is the existence of a legal basis for processing given its context and purpose. As with its predecessor, the GDPR sets out a (expanded) number of potential legal bases that can be used to justify the processing of personal data.82

An alternative to consent as legal basis

In addition to `explicit consent', another potentially relevant legal base is where such processing may be in the “public interest”.101 This provision has thus far been used by Member States in their transposition of Directive 95/46/EC (and in other legislation) to permit processing of sensitive data for a range of purposes, including for scientific research.102

The critical role of ethics bodies

Despite the clear existence of a legal ground for the processing of sensitive data for research purposes that does not require consent, regulatory authorities and ethics bodies have, in many cases, been reticent to use this option, preferring to insist that researchers obtain consent or use anonymised data.119

Conclusion

Computational genetics is undergoing a revolution. A number of developments have fuelled this revolution. Chief amongst these is the increasing ability to produce (rapidly and for low cost) GWSs. These can be mined repeatedly because of increases in computing power. The possibility to access and share various forms of potentially compatible information throughout the online-connected world have not only allowed for more research opportunities but also changed the way we view genetic data in

References (55)

  • J. Bohannon

    Genealogy databases enable naming of anonymous DNA donors

    Science

    (2013)
  • J. Butler

    The future of forensic DNA analysis

    Phil Trans R Soc

    (2015)
  • CaiR. et al.

    Deterministic identification of specific individuals from GWAS results

    Bioinformatics

    (2015)
  • P. Carter et al.

    The social licence for research: why care data ran into trouble

    J Med Ethics

    (2015)
  • G. Chassang

    The impact of the EU general data protection regulation on scientific research

    Ecancermedicalscience

    (2017)
  • E. Clayton et al.

    Frontotemporal dementia caused by CHMP2B mutation is characterised by neuronal lysosomal storage pathology

    Acta Neuropathol (Berl)

    (2015)
  • T.U.K. Consortium

    The UK10K project identifies rare variants in health and disease

    Nature

    (2015)
  • P. De Hert et al.

    Privacy, data protection and law enforcement. opacety of the individual and transparency of the power

  • K. Deribe et al.

    Mapping the geographical distribution of podoconiosis in Cameroon using parasitological, serological, and clinical evidence to exclude other causes of lymphedema

    PLoS NeglTrop Dis

    (2018)
  • L. Dubois et al.

    Genetic and environmental contributions to weight, height, and bmi from birth to 19 years of age: an international study of over 12,000 twin pairs

    PLoS One

    (2012)
  • R. Fears et al.

    Data protection regulaiton and the promotion of health research: getting the balance right

    Q J Med

    (2014)
  • M. Friedewald et al.

    Open consent, biobanking and data protection law: can open consent be ‘informed’ under the forthcoming data protection regulation?

    Life Sci Soc Policy

    (2015)
  • N. Ghani et al.

    Big data and data protection - issues with purpose limitation principle

    Int J Adv Soft Comput Appl

    (2016)
  • S. Gutwirth et al.

    European data protection: in good health?

    (2012)
  • M. Gymrek et al.

    Identifying personal genomes by surname inference

    Science

    (2013)
  • HeK.Y. et al.

    Big data analytics for genomic medicine

    Int J Mol Sci

    (2017)
  • HongE.P. et al.

    Sample size and statistical power calculation in genetic association studies

    Genom Inform

    (2012)
  • Cited by (23)

    • Towards a privacy impact assessment methodology to support the requirements of the general data protection regulation in a big data analytics context: A systematic literature review

      2022, Computer Law and Security Review
      Citation Excerpt :

      We identified 13 established PIA methodologies in our publication sample (Table 4). Twenty articles referred to the DPIA imposed by the GDPR (i.e. (Bu-Pasha, 2020; Bisztray and Gruschka, 2019; Coles et al., 2018; Crockett et al., 2018; Custers et al., 2018; Raphaël Gellert, 2018; Drewer and Miladinova, 2017; Easton, 2017; Raphael Gellert, 2017; Gonçalves, 2017; Edwards et al., 2016; Mantelero, 2014; Notario et al., 2015; Puijenbroek and Hoepman, 2017; Quinn and Quinn, 2018; Todde et al., 2020; van Dijk et al., 2016; Wei et al., 2020; Wright and Raab, 2014; Yordanov, 2017)). The EU DPIA has likely received interest with the introduction of the GDPR as the new data protection regulation in Europe and because it mandates impact assessments for privacy-vulnerable data processing operations.

    • Challenges in big data adoption for Malaysian organizations: a review

      2024, Indonesian Journal of Electrical Engineering and Computer Science
    View all citing articles on Scopus
    View full text