Ecosystem challenges around data use

Ecosystem Challenges Around Data Use Leonid Zhukov

Ancestry.com

2

•  World’s largest online family history resource

•  Started as a publishing company in 1983, online from 1996

•  2.7 million worldwide subscribers

Data at Ancestry

• Historical records – company acquired content collecFons

• User created content: – Ancestor profiles and family trees – Uploaded photographs and stories

• User behavior data on Ancestry.com

• Customer DNA data

•  10 PB of structured and unstructured data

3

Historical records

• Historical Content – 14 billion historical records going back to 17th century – DigiFzed and searchable

4

Historical records

5

•  More than 30,000 content collecFons

User family trees

6

•  Family trees: – 60 million family trees – 6 billion profiles

Family trees

7 7

Power law distribuFon tree sizes

500 nodes 700 edges

55 generaFons

Fme

User contributed content

– 200 million uploaded family photos and stories

8

Person and record search

9

•  Search query

Record linkage

10

•  Record linkage – finding and matching records in mulFple data sets with non-‐unique idenFfiers (data matching, enFty disambiguaFon, duplicate detecFon etc)

•  Goal: bring together informaFon about the same person

•  Some non-‐unique idenFfiers: –  Names: first name, last name (John Smith – 300,000 records) –  Dates: date of birth, date of death –  Places: place of birth, residence, place of death –  Extra: family members, life events

•  Records o_en incomplete and contain mistakes

•  Other industries: banking, insurance, government etc

User behavior data

• User behavior data: – 75 mln searches daily – 10 mln profiles added daily – 3.5 mln records aaached daily

11

DNA Data

• Direct to consumer DNA test

•  700,000 SNPs per sample

•  400,000 DNA samples

• No medical studies

12

Ancestry DNA

• GeneFc ethnicity – Reference panel – 26 ethnic regions, 3000 samples

13

Ancestry DNA

14

• GeneFc inheritance –  IdenFty-‐by-‐descent – Cousin matching

Matching DNA

DNA data: privacy and research

15

183

Interest in understanding howgenetic variations influenceheritable diseases and the re-

sponse to medical treatments isintense. The academic communi-ty relies on the availability ofpublic databases for the distribu-tion of the DNA sequences andtheir variations. However, likeother types of medical informa-tion, human genomic data are pri-vate, intimate, and sensitive.Genomic data have raised specialconcerns about discrimination,stigmatization, or loss of insur-ance or employment for individu-als and their relatives (1, 2).Public dissemination of these data posesnonintuitive privacy challenges.

Unrelated persons differ in about 0.1%of the 3.2 billion bases in their genomes(3). Now, the most widely used forms offorensic identification rely on only 13 to15 locations on the genome with variablerepeats (4, 5). Single nucleotide polymor-phisms (SNPs) contain information thatcan be used to identify individuals (5, 6). Ifsomeone has access to individual geneticdata and performs matches to public SNPdata, a small set of SNPs could lead to suc-cessful matching and identification of theindividual. In such a case, the rest of thegenotypic, phenotypic, and other informa-tion linked to that individual in publicrecords would also become available.

The world population is roughly 1010.Specifying DNA sequence at only 30 to 80statistically independent SNP positions willuniquely define a single person (7). Further-more, if some of those positions have SNPsthat are relatively rare, the number that needto be tested is much smaller. If informationabout kinship exists, a few positions will con-firm it. Thus, the transition from private toidentifiable is very rapid (see the figure).

Tension between the desire to protectprivacy and the need to ensure access to sci-

entific data has led to a search for new tech-nologies. However, the hurdles may begreater than had been suspected. For exam-ple, one approach to protecting privacy is tolimit the amount of high-quality data re-leased and randomly to change a small per-centage of SNPs for each subject in thedatabase (8). Suppose that 10% of SNPs arerandomly changed in a sequence of DNA, afairly major obfuscation that would notplease many genetics researchers. Our esti-mates (7) show that measuring as few as 75statistically independent SNPs would de-fine a small group that contained the realowner of the DNA. Disclosure controlmethods such as data suppression, dataswapping, and adding noise would be unac-ceptable by similar arguments.

A second approach is to group SNPs into bins. Disregarding exact genomic lo-cations of SNPs increases the number ofrecords that share the same values, thus in-creasing confidentiality. Our calculations(7) show that such strategies do not protectprivacy, because the pattern of binned val-ues is unlikely to match anyone other thanthe owner of the DNA. Data analysis wouldbe greatly complicated by binning, and theinformation content would be severely re-duced or even eliminated.

Until technological innovations appear,solutions in policy and regulations must befound. We are building the Pharmaco-genetics and Pharmacogenomics KnowledgeBase (8, 9), which contains individual geno-type data and associated phenotype infor-

mation. No genetic data will be providedunless a user can demonstrate that he or sheis associated with a bona fide academic, in-dustrial, or governmental research unit andagrees to our usage policies (including auditof data access) (10). Although this does notprevent data abuse, it provides a way tomonitor usage.

Social concerns about privacyare intricately connected to beliefsabout benefits of research andtrustworthiness of researchers andgovernmental agencies. In theUnited States, the Health InsurancePortability and Accountability Actof 1996 (HIPAA) and the associat-ed Privacy Rules of 2003 (11) gen-erally forbid sharing identifiabledata without patient consent.However, they do not specificallyaddress use or disclosure policiesfor human genetic data. Recent de-bates in Iceland, Estonia, Britain,and elsewhere (12–15), reveal arange of views on the threats posed

by genetic information. The United Statesmay be at one end of this spectrum, as its cit-izens seem to strongly desire health privacy.Whatever the setting, we recommend explic-it clarifications to rules and legislation (suchas HIPAA), so that they explicitly protect ge-netic privacy and set strong penalties for vio-lations. These clarifications should defineentities authorized to use and exchange hu-man genetic data and for what purposes.

References and Notes 1. M. R. Anderlik, M. A. Rothstein, Annu. Rev. Genomics

Hum. Genet. 2, 401 (2001).2. P. Sankar, Annu. Rev. Med. 54, 393 (2003).3. W. H. Li, L. A. Sadler, Genetics 129,513 (1991).4. L. Carey, L. Mitnik, Electrophoresis 23, 1386 (2002).5. H. D. Cash et al., Pac. Symp. Biocomput. 2003, 638

(2003).6. National Commission on the Future of DNA Evidence,

The Future of Forensic DNA Testing: Predictions ofthe Research and Development Working Group(National Institute of Justice, U.S. Department ofJustice, Washington, DC, 2000).

7. See supporting online material for further discussion.8. L. C. R. J. Willenborg, T. D. Waal Elements of Statistical

Disclosure Control (Springer, New York, 2001).9. T. E. Klein et al., Pharmacogenomics J. 1, 167 (2001).

10. www.pharmgkb.org/home/policies/index.jsp11. Fed. Regist. 67, 53181 (2002).12. R. Chadwick, BMJ 319, 441 (1999).13. L. Frank, Science 290, 31 (2000).14. M. A. Austin et al., Genet. Med. 5, 451 (2003).15. V. Barbour, Lancet 361, 1734 (2003).16. Supported in part by NIH/NLM Biomedical Infor-

matics Training Grant LM007033 (Z.L.), NSF GrantDMS-0306612 (A.B.O.), and the NIH/NIGMS Pharma-cogenetics Research Network and Database U01-GM61374 (R.B.A). We thank J. T. Chang, B. T.Naughton, T. E. Klein, and reviewers.

Supporting Online Materialwww.sciencemag.org/cgi/content/full/305/5681/183/DC1

G E N E T I C S

Genomic Research andHuman Subject Privacy

Zhen Lin,1 Art B. Owen,2 Russ B. Altman1*

1Department of Genetics, Stanford University Schoolof Medicine, CA 94305–5120, USA. 2Department ofStatistics, Stanford University, CA 94035–4065, USA.

*To whom correspondence should be addressed. E-mail: [email protected]

POLICY FORUM

Priv

acy

Independent SNPs

Low

Medium

High

5 75 100 125 1000 2000 3000 4000

Insufficient for future genomic research

Insufficient for privacy protection

Needed to find genetic relationshops

Trade-offs between SNPs and privacy.

www.sciencemag.org SCIENCE VOL 305 9 JULY 2004

on

July

9, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

July

9, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

Z. Lin, A. Owen, R. Altman, Science, vol 305, 2004

Challenges

•  Engineering – Scalability – Availability – Security

• Research –  InformaFon retrieval – DNA genomic research

•  Privacy

16

Date post:	30-Oct-2014
Category:	Science
Upload:	leonid-zhukov
View:	103 times
Download:	1 times