OutlineMain References
IntroductionBackground
The proposed scheme
Privacy-Preserving Computation of Disease Risk byUsing Genomic, Clinical, and Environmental Data
Reporter:Ximeng LiuSupervisor: Rongxing Lu
School of EEE, NTUhttp://www.ntu.edu.sg/home/rxlu/seminars.htm
May 1, 2014
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
1 Main References
2 Introduction
3 Background
4 The proposed scheme
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Main References
1. Ayday E, Raisaro J L, McLaren P J, et al. Privacy-PreservingComputation of Disease Risk by Using Genomic, Clinical, andEnvironmental Data[J].2. Damg̊ard I, Geisler M, Krøigaard M. Efficient and securecomparison for on-line auctions[C]//Information Security andPrivacy. Springer Berlin Heidelberg, 2007: 416-430.
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Introduction
As a result of the rapid evolution in genomic research and thedramatic decrease in the costs of sequencing, the paradigm ofclassic medicine has been shifting towards a more personalizedapproach. The use of individual genomic, clinical, andenvironmental data can be of interest for a large variety ofhealthcare stakeholders (here described as medical units).
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Introduction
Also, in order to protect the privacy of his sensitive data, anindividual might not want to directly provide his genomic data andclinical and environmental attributes to the medical unit. Becauseof its extremely sensitive nature, genomic data has anunprecedented impact on privacy. In particular, because thegenome carries information about a person’s genetic condition andpredispositions to specific diseases, the leakage of such informationcould enable abuse and threats.
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Genomic Background
The human genome is encoded in a double-stranded helical DNAmolecule, as a sequence of nucleotides. Genome sequencingtechniques record the nucleotides by using the letters A, T, G andC, and the whole human genome includes approximately 3 billionletters.
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Genomic Background
Around 99.9% of the entire genome is identical between any twogiven individuals. The remaining part ( 0.1%) is responsible formany of our interindividual differences. The latter are called singlenucleotide polymorphisms (SNPs) when they are found to bevariable in at least 1% of the individuals in a population.
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Genomic Background
Two different alleles are observed for every SNP. In general, for aSNP that is associated with a disease, one of these alleles carriesthe risk for the corresponding disease and the other allele does notcontribute. For example, assume that the SNP in the aboveexample (with alleles G and T ) is associated with a particulardisease X .
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Genomic Background
In this paper, the author represent (i) an homozygous SNPcarrying two noncontributing alleles as 0, (ii) an heterozygous SNPcarrying one risk (or protective) allele and one non-contributingallele as 1, and (iii) an homozygous SNP carrying two risk (orprotective) alleles as 2. In short, each SNP can be in one of thestates from {0, 1, 2}, and we let SNPP
i represent the state(content) of SNPi (SNP with ID i) for a patient P.
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Genomic Background
Figure: Security model
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Computation of the Disease Risk
The strength of the association between each SNP and a disease isusually expressed by the odds ratio (OR), where the odds is theratio of the probability of occurrence of the disease to that of itsnon-occurrence in a specific group of individuals. Thus, the OR isthe ratio of odds in the group of individuals carrying a geneticvariation (exposed) to that of those who do not carry it(unexposed).
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Computation of the Disease Risk
When multiple SNPs are associated with a disease, the overallgenetic risk (S) of an individual for the corresponding disease canbe computed as a weighted average, based on the OR of eachassociated SNP by using a logistic regression model.
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Computation of the Disease Risk
This model is currently widely used among the geneticists andmedical doctors for disease risk tests. In such a model, OR of aSNPi (i.e., ORi ) is generally represented in terms of regressioncoefficient (βi ), where ORi = exp(βi ). Then, assuming Prg is theprobability that an individual P will develop a disease X (onlyconsidering his genomic data), his overall genetic risk can becomputed as below:
S = In(Prg
1− Prg) = α +
∑i∈ϕx
βipij (X )(1)
where pij (X ) is the contribution of the SNPi to the genetic risk (for
disease X) when SNPPi = j (SNPP
i ∈ 0, 1, 2, and α is the interceptof the model.
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Computation of the Disease Risk
For clinical use, the genetic risk, computed in (1) should becategorized based on its risk group. For this purpose, generally, thedistribution of the potential genetic scores (in a given population)is divided into smaller parts called quantiles
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Example
Figure: Security model
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Example
There are 4 different risk groups, each with a different geneticregression coefficient. For example, if S is somewhere between b1and b2, then we assign the genetic regression coefficient for thecorresponding individual as β2. For each individual, the geneticscore is computed as in (1), and positioned into its risk group. Werepresent the genetic regression coefficient corresponding to thegenetic risk S as βg .
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Example
The overall disease risk, the genetic information needs to becombined together with the clinical and environmental factors. Forthis purpose, assuming Pr is the probability of disease X (this timeconsidering genetic, clinical, and environmental information), asecond and final multi-variable logistic regression model is used tofind the final (aggregate) regression coefficient βf as below:
In(Pr
1− Pr) = βf = β0 + βg +
∑Ni∈N
β̄iNi
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Example
Where β0 is the new intercept, N is the set of clinical andenvironmental attributes associated with the disease, and β̄i is theregression coefficient corresponding to the clinical or environmentalattribute Ni . The probability (Pr) that the corresponding
individual will develop disease X as eβf
1+eβf.
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Modified Paillier cryptosystem
The public key is represented as (n, g , h = g x), where the strongsecret key is the factorization of n = zy (z , y are safe primes), theweak secret key is x ∈ [1, n2/2], and g of the order(z − 1)(y − 1)/2.Encryption: To encrypt a message m ∈ Zn, we first select a randomr ∈ [1, n/4] and generate the ciphertext pair (C1,C2) as below:
C1 = g r mod n2
C2 = hr (1 + mn) mod n2
For simplicity, in the rest of this paper, we represent the Paillierencryption of a message m as [m].
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Modified Paillier cryptosystem
Encryption: The message m can be recovered from [m] as follows:
m = ∆(C2/Cx1 )
where∆(u) = (u−1) mod n2
nProxy re-encryption: Assume we randomly split the secret key intwo shares x1 and x2, such that x = x1 + x2. The modified Pailliercryptosystem enables an encrypted message (C1,C2) to be partiallydecrypted to a ciphertext pair (C ′1,C
′2) using x1 as below:
C ′1 = C1
C ′2 = C2/Cx11 mod n2
Then, (C ′1,C′2) can be decrypted using x2 with the aforementioned
decryption function to recover the original message.http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
DGK cryptosystem
Random elements g , h ∈ Zn such that the multiplicative order of his v modulo p and q, and g has order uv . The public key is nowpk = (n, g , h, u) and the secret key is sk = (p, q, v). Theciphertext be Epk(m, r) = gmhrmodn. where m is the message. Itis also possible to do a real decryption by noting thatEpk(m, r)v = (g v )m mod n Clearly, g v has order u, so there is a1C1 correspondence between values of m and values of (g v )m
mod n. Since u is very small, one can simply build a tablecontaining values of (g v )m mod n and corresponding values of m.
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
DGK cryptosystem
random elements g , h ∈ Zn such that the multiplicative order of his v modulo p and q, and g has order uv . The public key is nowpk = (n, g , h, u) and the secret key is sk = (p, q, v). Theciphertext be Epk(m, r) = gmhrmodn. where m is the message.Decryption Epk(m, r)v = (g v )m mod n Clearly, g v has order u, sothere is a 1-1 correspondence between values of m and values of(g v )m mod n. Since u is very small, one can simply build a tablecontaining values of (g v )m mod n and corresponding values of m.
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
DGK cryptosystem
1)A and B compute, for i = 1, · · · , l sharings [wi ] where
wi = mi + xi − 2ximi = mi ⊕ xi
2) A and B now compute, for i = i = 1, · · · , l sharings [ci ] whereci = xi −mi + 1−
∑lj=i+1 wi . Note that if m > x , then there is
exactly one position i where ci = 0.3) A uses his secret key to decide, as described in the previoussection, whether any of the received encryptions contain 0. If thisis the case, he outputs m > x . Otherwise, m ≤ x .
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
System Model
Figure: Security model
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
System Model
The encryption of the genomic data of the patient are performedat a certified institution (CI), which is a trusted entity. Themedical unit can be a pharmacist, a pharmaceutical company, aregional health ministry, an online direct-to-consumer serviceprovider, or a physician.The storage and processing of genomic, clinical, and environmentaldata is done at a storage and processing unit (SPU) for efficiencyand security. We note that a private company (e.g., cloud storageservice), the government, or a non-profit organization could playthe role of the SPU.
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Initialization
The patient’s secret key x is randomly divided into x1 and x2 (suchthat x = x1 + x2 ) and each share is distributed to the SPU and tothe MU, respectively (i.e., x1 is provided to the SPU and x2 to theMU)
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Sequencing and Clinical and Environmental DataCollection
Clinical and environmental data of the patient is collected duringhis doctor visits or directly provided by the patient. For example,data about his cholesterol level or his blood-sugar level is collectedduring his doctor visits. Whereas, data such as his age, weight,height, or family history is provided by the patient.
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Encryption and Storage of Genomic, Clinical, and Envi-ronmental Data
After the sequencing and the extraction of the SNPs of thepatient, the CI encrypts the contents of all SNP positions of thepatient (to obtain [SNPP
i ]).
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Privacy-Preserving Computation of Disease Risk
1) Computing the genetic risk: As before, let SNPi represent theposition (or ID) of a SNP, SNPP
i represent the content of thecorresponding SNP (SNPP
i ∈ 0, 1, 2), and βi represent theregression coefficient, thus the strength of the association betweenSNPi and disease X . Also, let pij (X ) be the contribution,depending on the content, of the SNPi to the genetic risk (fordisease X ) when SNPP
i = j . Then, the MU computes the(encrypted) genetic risk ([S]) of patient P to disease X using theencrypted SNPs of the patient as below:
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Privacy-Preserving Computation of Disease Risk
Figure: Security model
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Privacy-Preserving Computation of Disease Risk
However, as the above computed genetic risk is encrypted, to findthe regression coefficient corresponding to the computed geneticrisk, we propose to use a privacy-preserving integer comparisonalgorithm between the MU and the SPU.
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Privacy-Preserving Computation of Disease Risk
2)Computing the genetic regression coefficient: We let bli and buirepresent the lower and upper boundary of the i-th risk group ofthe genetic risk scale, respectively. In short, MU compares [S] withthe boundaries of the genetic risk scale in a privacy-preserving way,such that neither the MU nor the SPU learns the value of S or theresult of any comparison.
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Privacy-Preserving Computation of Disease Risk
Figure: Security model
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Privacy-Preserving Computation of Disease Risk
The MU computes [z ] = [2L + S− bji ]. Let zL−1 represent the
most significant bit of z . Then, (i) zL−1 = 0 if S < bji ; and (ii)
zL−1 = 1 if S ≥ bji . Thus, the MU needs to compute [zL−1], where[zL−1] = [z − (zmod2L)].
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Computing [λ]
Figure: Security model
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Computing [λ]
Where < ci >=< d̂i − r̂i + s + 3∑L−1
j=i+1 wj >.If a = 1 and s = 1 (the number randomly selected by the MU),then d̂ ≥ r̂ (i.e., λ = 0). Similarly, if a = 0 and s = 1, then r̂ > d̂(i.e.,λ = 1). Thus, if s = 1, the MU sets [λ] = [1− a] and ifs = −1, it sets [λ] = [a]. Using [λ], the MU can compute[zmod2L].
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Computing [λ]
Let [G (S, bui )] = [zL−1] represent the (encrypted) result of thecomparison between S and bui . Then, (i) G (S, bui ) = 0 if S < bli ;and (ii) G (S, bui ) = 1 if S ≥ bli .
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Computing [λ]
Figure: Security model
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Computing the final disease risk
[βf ] = [β0 + βg +m∑i=1
β̄iNi ]
Ni = 1 if the patient has the corresponding clinical orenvironmental attribute, and Ni = 0 otherwise. As we discussedbefore, even if Ni is non-binary, it can be transformed to a binarynumber using the privacy-preserving comparison algorithm.Finally, the MU computes the final disease risk of the patient fordisease X as eβf
1+eβf
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
OutlineMain References
IntroductionBackground
The proposed scheme
Thank youRongxing’s Homepage:
http://www.ntu.edu.sg/home/rxlu/index.htmPPT available @: http://www.ntu.edu.sg/home/rxlu/seminars.htm
Ximeng’s Homepage:http://www.liuximeng.cn/
http://www.ntu.edu.sg/home/rxlu/seminars.htm Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data