Tension Between Data Sharing and the Protection of Privacy in Genomics Research
© Springer Science+Business Media Dordrecht 2015Deborah Mascalzoni (ed.)Ethics, Law and Governance of BiobankingThe International Library of Ethics, Law and Technology1410.1007/978-94-017-9573-9_8
The Tension Between Data Sharing and the Protection of Privacy in Genomics Research
University of Oxford, Oxford, UK
This chapter has been already published as: Kaye (2012b). The Tension Between Data Sharing and the Protection of Privacy in Genomics Research. Annual Review of Genomics and Human Genetics 13:415–31. We kindly thank the publisher for allowing the reprint.
With the costs of sequencing technology falling rapidly, we are moving to a position where whole-genome scanning of individual DNA samples will start to become routine in medical research and clinical medicine. This is also a critical point in time for the building of infrastructure and the linkage of existing biobanks and bioclinical projects. These plans are starting to be operationalized to enable the sharing of data and samples in a systematic way on a large scale. However, the meta-level governance mechanisms that are needed to support this are still in development. The move to global data sharing has been facilitated by funding bodies on both sides of the Atlantic, which have supported large international collaborative projects and developed open access policies to encourage wide-scale data sharing. In combination, these trends challenge some of the basic principles of protection of research participants and the current governance frameworks for research. One of the key challenges is determining how to protect the privacy of participants while enabling the sharing of data and samples through global research networks. To provide some understanding of the concerns raised by data sharing, this review outlines the issues involved in privacy protection as well as the current trends that have transformed genomics research practice and facilitated data sharing. It describes how data sharing tests current ethical principles and oversight mechanisms for medical research. In conclusion, it discusses ways forward and some of the new initiatives being developed to facilitate data sharing and enable sustainable genomics research.
2 The Nature of Privacy
The protection of individual privacy is enshrined in legal instruments of all liberal democracies and is a benchmark of civil society. Although privacy is not an absolute right, interference must be justified in the public interest and/or according to law. An example of how the courts in the United Kingdom regard privacy is from Lord Justice Laws of the Court of Appeal:
Subject to [certain] qualifications … an individual’s personal autonomy makes him – should make him – master of all those facts about his own identity, such as his name, health, sexuality, ethnicity, his own image … and also of the “zone of interaction” … between himself and others. He is the presumed owner of these aspects of his own self; his control of them can only be loosened, abrogated, if the State shows an objective justification for doing so. (Wood 2009 at §21)
Individual expectations of privacy are context specific, and so can vary depending upon the individual and the circumstances. Within research, the expectations and norms associated with different kinds of research can lead to variation in the practices that apply. Privacy consists of four interrelated dimensions, which come into play in different ways depending upon the context: physical privacy, informational privacy, decisional privacy, and proprietary privacy (Laurie et al. 2010). In the case of genomics research, any or all of these dimensions may be activated depending upon the context. Within genomics research, some of the privacy risks have been identified “as analysis efforts aimed at exposing individual research participants’ information, including revealing disease status, predicted future likelihood or past presence of other traits, or attempts to link another DNA result with a participant, for example, to determine presence or absence in a research cohort, ancestry, and relatedness (e.g., paternity/nonpaternity)” (Johnson et al. 2011). To safeguard against such harms, protections must be established to prevent discrimination against participants and ensure that their medical and personal information is not disclosed to third parties—such as their family or community members, employers, or insurance companies—against their wishes (Gitter 2011). This is because the character of DNA means that sequence information has implications for other biologically related family members, and “the fact that children carry half the genetic information of their parents implies that a decision to reveal one’s genetic information today has repercussions for generations to come” (Johnson et al. 2011). These concerns have led to considerable debate within the genomics community as to how best to protect participants’ and their biological relatives’ privacy while still allowing research to proceed.
3 Trends Within Science
Over the past 10 years, there have been significant changes in the way that genomics research is carried out that have implications for privacy protection. These changes are part of a longer term evolution in scientific practice that has been occurring over a number of decades. Genomics research is now increasingly dependent upon the sharing of data and samples through global collaborative research networks. This widespread data sharing and the building of global research networks are possible only because of technological advances, considerable investment in infrastructure and international consortia, and the implementation of open access policies by funding bodies. Achieving research goals and priorities at an international level would not have occurred at the same scale and speed without the advances in bioinformatics and computing technology, which in turn have led to changes in scientific practice and the way that research is carried out. The relatively recent introduction of next-generation whole-genome sequencing technology adds another layer of complexity to this situation.
4 New Models
The way that genomics research is carried out today, based on the principles of open access and sharing, has its origins in the Human Genome Project, which commenced in 1990 and was completed in 2001. This project arked the beginning of a new way of doing genomics research, as it relied on the collaboration of many scientists, institutions, and funders from around the world (Colins et al. 2003). It marked a transition from a “cottage industry” approach based on bespoke laboratories to high-throughput sequencing involving teams of multidisciplinary experts (Watson 1990). The possibility of the human genome being patented by a private company, Celera, helped to confirm and develop the principle that such knowledge should be freely available to all (Nature 2001; Marris 2005). Using the Human Genome Project approach, a number of data-generating projects have been initiated through joint efforts by national funders, including the Encyclopedia of DNA Elements (ENCODE; http://www.genome.gov/10005107), the Human Epigenome Project (http://www.epigenome.org), the International HapMap Project (http://www.hapmap.org), and, more recently, the 1,000 Genomes Project (http://www.1000genomes.org). These have provided unrestricted access to sequence reference libraries via the Internet. Such resources allow new types of scientific questions to be asked, as “vast numbers of polymorphisms can be studied simultaneously, rather than focusing attention on a small number of genes,” and “very many more individuals can be genotyped in a single study” (Day 2009). Such data sets have been presented as the “drivers of progress in biomedical research,” and therefore open access policies have stated that they should be “made immediately available for free and unrestricted use by the scientific community to engage in the full range of opportunities for creative science” (Marris 2005). The role of such projects in advancing science has been seen as testimony to the success of open access policies. However, there have been concerns expressed about the privacy risks that placing individuals’ sequence data on the Web may have for research participants (Wood 2009 at §21).
In addition to sequence reference libraries, repositories have been established to centrally organize the storage and sharing of data derived from genome-wide association studies (GWAS) (GAIN 2007). These studies compare the genomes of healthy controls with those of people who exhibit a disease or a specific trait in order to identify the genetic variants associated with that disease or trait (Kaye et al. 2009). To obtain the sample sizes needed to do this, researchers have developed new models of collaboration and data sharing. Examples of these projects are the Wellcome Trust Case Control Consortium (Wellcome Trust 2007) in the United Kingdom, the European Genome-Phenome Archive (http://www.ebi.ac.uk/ega), and the National Institutes of Health’s Database of Genotypes and Phenotypes (dbGaP; http://www.ncbi.nlm.nih.gov/gap) in the United States. The aim of these platforms is to maximize the public benefits that can be realized from data sharing (Gibbs 2005), and new methodologies and approaches have had to be developed to handle the vast amounts of data created (Pop and Salzberg 2008). Data must also be deposited within a specific period of time and must meet certain standards of quality. This requirement also includes statements about the nature of the study, which are intended to standardize models for performing studies and reporting results (Little et al. 2009). Such pooling of data ensures that the validity of results can be confirmed in replication studies before they are relied upon, providing a further reason for sharing data (Ioannidis et al. 2008). Unlike access to sequence reference libraries, access to these data sets is provided through a managed access system that requires researchers to establish their credentials and then be approved by a data access committee. There is concern that this managed access model is not as effective for sharing data as Web-based sequence reference data sets, which receive many more hits.
5 Infrastructure Development
In the field of biobanking there has been considerable investment in new population biobanks and cohort studies. One of the research rationales behind the establishment of these resources is to develop information that can help elucidate the fine associations between the genotypes and phenotypes that influence the etiology of common diseases. The need for diverse, well-characterized, large sample groups, both for investigative purposes and for use as controls (Burton et al. 2009), has led to an increased emphasis on cooperation at both the national and international levels (Hattersley and McCarthy 2005). A number of groups have been funded to develop the tools to standardize and harmonize collection and management procedures in order to facilitate wide-scale data and sample sharing, including the Public Population Project in Genomics (P3G; http://www.p3gconsortium.org) and the International Society for Biological and Environmental Repositories (ISBER; http://www.isber.org). Over the past two years there has also been investment in infrastructure to facilitate the linkage and greater use of existing clinical collections of samples, including the Biobanking and Biomolecular Resources Research Infrastructure (BBMRI) (Viertler and Zatloukal 2008) and Biobank Standardisation and Harmonisation for Research Excellence (BioSHaRE; http://www.p3g.org/bioshare) projects in Europe and the Electronic Medical Records and Genomics (eMERGE) Network in the United States (McCarty et al. 2011). The aim of this investment is to provide resources that networks of interdisciplinary teams and consortia located around the globe can use to answer a number of research questions. Large international consortia within Europe and the United States have been funded, as have international collaborations on a grand scale such as the International Cancer Genome Consortium (http://www.icgc.org).
This new emphasis on the linkage of existing biobanks through a common infrastructure requires macro-level, international governance structures and processes to allow the secondary research use of existing information and samples. This raises significant questions about the oversight of global research activity and the best ways to safeguard researcher access to information while protecting the privacy of individuals. Many of the secondary research purposes such infrastructure will make possible were not anticipated at the time when consent was obtained for the collection of the data or samples. The arguments for using existing research collections for secondary research purposes are twofold: First, recruitment to large studies is expensive and time-consuming, and second, larger sample sizes are likely to accelerate research results. Reusing, integrating, and comparing collections will result in an efficient and effective use of funding. However, appropriate governance systems and procedures to link and network new and existing collections at this macro level are still being developed.
6 Open Access Policies
The arguments for the efficient use of resources funded by the public purse also underpin many of the recent open access policies that have been developed by the leading funders of genomics research within the United States, the United Kingdom, and Canada. These policies started with the sequence data [the Bermuda Principles in 1996 (HUGO 1996) and the Fort Lauderdale Agreement in 2003 (Wellcome Trust 2003)] and have now been applied to other forms of data [the Toronto International Data Release Workshop recommendations in 2009 (Birney et al. 2009), the Amsterdam Principles in 2008 (Rodriguez et al. 2009), and the Wellcome Trust joint statement by funders of health research (Wellcome Trust 2010)]. In addition, there are a number of policies for data sharing by the Organisation for Economic Cooperation and Development (OECD) (OECD 2007). All of these policies have statements requiring the protection of individual privacy and in some cases the dignity of communities while at the same time encouraging wide-scale data sharing for public benefit.
Although these policies are still in their infancy, we are starting to see their impact on the planning, execution, and oversight of genomics research and on the way results are disseminated. The question now is how to share data rather than whether data should be shared at all (Kaye et al. 2009). These policies have created a climate in which data sharing is becoming more the norm—not just for large sequencing projects but for many different types of studies. However, there is still evidence to suggest that researchers are reluctant to share data (Piwowar 2011). Open access principles have also come into conflict with privacy concerns. In 2008, aggregate genetic data placed on the Web by researchers for GWAS use had to be withdrawn once it was realized that individual participants could be distinguished from the openly shared data (Homer et al. 2008). These problems of identifiability and disclosure risks are likely to become more frequent as increasingly diverse sources of data are linked (Heeney et al. 2011; P3G et al. 2009).
7 Technological Advances
Advances in information technology and genome sequencing technology have enabled significant changes in the ways that science is carried out and have provided a means to share data on a wide scale. Digital information can be deposited on the Web or in a cloud and then shared with colleagues and other third parties. Once DNA is sequenced from a sample and transformed into a digital form of AGTC base pairs, it can be used for many different purposes and analyzed by different researchers using different methodologies and approaches. The current challenges include issues of data storage, the quality of sequencing data, and the accuracy of genome assembly (Butler 2010) as well as how best to manage and interpret large data sets of sequence information (Mardis 2011). The advances in next-generation sequencing technology have resulted in far richer and more detailed sequence information at a lower cost. Whereas it is estimated that the Human Genome Project cost US$2.7 billion (NHGRI 2010), in 2009 the company Complete Genomics announced that it could sequence an individual genome for US$5,000 (Aldhous 2009). It is anticipated that these costs will continue to fall and that sequencing will no longer be a bespoke activity but will become a routine part of clinical care. As sequencing becomes cheaper, the use of whole-genome sequencing will become the norm in medical research and bring with it a number of new issues.
The challenges that this presents led Mardis and Lunshof (2009) to write that “the established framework of ethical, legal and social issues (ELSI) in genomics has been shaken to its foundations by something as simple as the emergence of personal genomes.” Tabor et al. (2011) note that: whereas conventional technological approaches might generate data on hundreds of thousands, or even millions of polymorphisms, the overwhelming majority of these variants are located in noncoding regions and likely not of functional significance themselves. In contrast, both exome sequencing (ES) and whole-genome sequencing (WGS) provide information on virtually all functional, protein-coding variants in the genome for each individual participant. This includes most variants known to influence risk of human diseases and traits. These technologies increase the possibility of identifying serious treatable conditions and generating other incidental findings (Wolf et al. 2008) and have created a heated debate as to whether there is an obligation to report research findings to participants and, if so, how this should be done. This reporting raises a number of ethical issues, such as how to develop management pathways and privacy safeguards, and questions of whether secondary and tertiary researchers also have an obligation to report back findings. New models of reciprocal participation in research that also provide individual-level information have been developed by companies such as 23andMe, where participants are treated as customers rather than “health information altruists” and are given access to genomic information (Kohane and Altman 2005). Further research is needed to establish whether such new models of participation are truly reciprocal, whether they could have wider application, and how management pathways for feedback could be developed (Van Ness 2008).
8 The Effect on Scientific Practice
In combination, these trends have had a marked effect on the scientific agenda and the conduct of genomics research. Research is now carried out by interdisciplinary teams of specialists brought together in flexible research collaborations that can process and analyze large amounts of information and large numbers of samples. The collection of information and samples is still carried out by individual researchers, but the model of large interdisciplinary collaborations means that existing collections can be brought together and reused for new purposes. This is possible only because technological advances make it easy to share and distribute data through global networks. Open access policies are changing the way that data are generated and distributed and are enabling new ways to mine data. Increasingly, there is now a distinction between data generators and data users. Data sets are no longer the sole creation or in the control of one individual or institution, but must be made available to the whole research community. For GWAS, this has been achieved through a new managed access model with formal application processes and access determined by data access committees in consultation with collectors, rather than decided by the principal investigator alone.
These changes in the way that science is conducted mean that the “secondary users of the data are far removed from the researchers who carried out the collection of the samples and data, as well as from the research participants” (Kaye et al. 2009). Data sharing has the potential to sever the ties between the researcher responsible for participant enrollment and the individual participants in an original study. The onward sharing of data raises questions about who is accountable not only to research ethics committees approving new research but also to the research participants for the secondary uses of data in other studies. These advances also challenge our legal and ethical frameworks as data-sharing practices give a new twist to the old questions of informed consent, protection of privacy, and governance of medical research. These trends have had a significant effect on the principles that underpin research and the basis of research participation.
9 Protections for Research Participants
The main purpose of the current research governance system is to protect participants’ interests and ensure that research is carried out ethically. It does not have a mandate to consider the broader ethical issues associated with data sharing, such as equitable access to biorepositories for researchers. A number of procedures, practices, and oversight bodies have been established that are designed to protect research participants. Common to all jurisdictions are the requirements that consent must be obtained before the research commences (although there are a number of exceptions to this basic principle), that an individual has a right of withdrawal, and that there must be some review of the research proposal by an appropriate committee, such as an institutional review board (in the United States) or a research ethics committee (within Europe and elsewhere). These protections derive from the Nuremberg Trial principles (NMT 1949, pp. 181–82), which were intended to protect individual research participants from physical harm rather than informational harm. They were not designed for use in global networks where information and samples flow through international research collaborations; rather, they were developed for a time when research was oriented toward one principal investigator, leading one research project, based within one country, located at one point in time—the “one researcher, one project, one jurisdiction” model (Kaye 2011). As a result, they are focused at the beginning of the research process, and oversight is largely reliant on expert committees.
The nature of whole-genome sequence data and the potential for global data sharing also brings into question the social contract that underpins research participation and the governance mechanisms that have been built around it. The basis for medical research participation has traditionally been an appeal to altruism (Hallowell et al. 2010), solidarity (Knoppers and Joly 2007), and/or the gift model (Busby 2004; Tutton 2002), depending upon the nature of the study (Kohane and Altman 2005). The degree of participant involvement in the research process has varied depending upon the type of study—for example, whether it is clinically based with direct patient contact or epidemiological and concerned with population trends. In some cases, participants have had a passive role as providers of samples, information, and interesting case examples of disease. In other types of research, such as research on HIV/AIDS, participants have been more actively involved in defining the research agenda (Kahn et al. 1998). In all cases, good practice has required that in return for being altruistic, participants’ personal, identifiable information should remain confidential and, if possible, be rendered anonymous. This has also been the basis for not having to obtain explicit consent for new research in cases where this may be difficult and when the risks to individuals are perceived to be low. Research procedures and practices have been established on the basis of this implicit social contract. The traditional workhorses of medical research governance—informed consent, withdrawal, anonymization, and oversight mechanisms—are tested by the new developments in genomics research practice caused by wide-scale data and sample sharing.
10 Informed Consent
Informed consent has been used to respect individuals and to enable research participants to exercise their autonomy in medical research and make decisions about privacy risks. The requirements for informed consent have been enshrined in a number of ethical documents, one example being the Declaration of Helsinki (WMA 2008). With wide-scale data sharing, it is impossible to fulfill the conditions of traditional informed consent as outlined in many ethical and legal documents (Boddington et al. 2011). Participants cannot be informed of all future uses of their information and samples over many years at the time of collection, nor can they be given an assessment of all the potential privacy risks of participation in the research (Beskow et al. 2001). Broad consent has become a practical solution to this problem for biobanks, but this is still contentious within the bioethics literature (Caulfield and Kaye 2009) as research participants are giving a broad consent at the beginning of the research process for the use of their sequence and data for many years. There is some doubt as to whether this enables individuals to fully exercise their autonomy, as they cannot choose whether to be involved in specific research projects using different biorepositories, determine what kind of research they participate in, or properly assess the privacy risks of involvement (Boddington et al. 2011). The focus on individuals and informed consent can also eclipse legitimate family and group privacy concerns, which may differ from those of individuals.
At the present time, consent forms are the only means by which the wishes of research participants can be obtained and recorded. This occurs at the beginning of the research process, when potential participants are presented with an agreement that they cannot negotiate (but can refuse to sign) and that, in the case of biobanks, has to hold for a considerable amount of time. Another limitation of the one-off informed consent form is that researchers must anticipate all eventualities to make the consent future-proof and avoid costly and time consuming recontact processes. This means that if data-sharing plans are described—which quite often is not the case—it is usually done in very broad terms (Beskow et al. 2001). This raises questions as to how informed participants actually are (Pearce and Smith 2011) and whether they are really in a position to assess the privacy risks of research involvement. Currently, efficient and cost-effective mechanisms by which to go back to individuals for new consent for secondary research are not commonplace. Effectively, broad consent is “consent for governance” by others, as judgments about appropriate uses of data and samples often fall to researchers, advisory boards, or research ethics committees, who must make decisions on behalf of research participants (Kaye 2009). In response to these shortcomings, other models have been proposed such as tiered consent (Haga and Beskow 2008; Wolf and Lo 2004), authorization instead of informed consent (Arnason 2004), and “open consent” (Lunshof et al. 2008). New forms of governance models, such as patient interfaces that give individuals greater control over their information (J. Kaye, E. Whitely, S. Creese, D. Lund and K. Hughes, manuscript in preparation), are in development, and “adaptive governance” mechanisms that give voice to group concerns rather than just those of individuals have also been proposed to address some of the deficiencies of the individual consent model (O’Doherty et al. 2011; Winickoff and Winickoff 2003).
The other foundational principle of medical research ethics is the right of withdrawal, which is the notion that a research participant can discontinue his or her participation in research at any time (Gertz 2008). This also applies to the data and samples that an individual may have given consent to use in research, and is one of the key ways by which an individual can enact decisional privacy. However, in the case of international data sharing this is extremely difficult, if not impossible, to achieve when data and samples are shared widely. Computer data sets containing personal information must be continually archived, and it is difficult to claw back minute segments of sequence spread over a global network when they are used in multiple research projects (Zika et al. 2008