+ All Categories
Home > Documents > Logic Regression and Interactions in High Dimensional...

Logic Regression and Interactions in High Dimensional...

Date post: 29-Mar-2018
Category:
Upload: truongdiep
View: 213 times
Download: 0 times
Share this document with a friend
15
Logic Regression and Interactions in High Dimensional Genomic Data Ingo Ruczinski Department of Biostatistics Johns Hopkins University Email: [email protected] http://biostat.jhsph.edu/ iruczins With Charles Kooperberg and Michael LeBlanc, FHCRC Motivation [With Kathy Helzlsouer and Han-Yao Huang] The odyssey cohort study consists of 8,394 participants who do- nated blood samples in 1974 and 1989 in Washington County, Maryland. The cohort has been followed until 2001, and envi- ronmental factors such as smoking and dietary intake are avail- able. The goals of the study include finding associations between polymorphisms in candidate genes and disease (including can- cer and cardiovascular disease). Particularly, gene-environment and gene-gene interactions associated with disease are of inter- est. Currently, SNP data from 51 genes are available.
Transcript
Page 1: Logic Regression and Interactions in High Dimensional ...iruczins/presentations/ruczinski.03.03.jhu... · Logic Regression and Interactions in High Dimensional Genomic Data ... STAGE

Logic Regression and Interactions inHigh Dimensional Genomic Data

Ingo Ruczinski

Department of BiostatisticsJohns Hopkins University

Email: [email protected]

http://biostat.jhsph.edu/ � iruczins

With Charles Kooperberg and Michael LeBlanc, FHCRC

Motivation

[With Kathy Helzlsouer and Han-Yao Huang]

The odyssey cohort study consists of 8,394 participants who do-nated blood samples in 1974 and 1989 in Washington County,Maryland. The cohort has been followed until 2001, and envi-ronmental factors such as smoking and dietary intake are avail-able. The goals of the study include finding associations betweenpolymorphisms in candidate genes and disease (including can-cer and cardiovascular disease). Particularly, gene-environmentand gene-gene interactions associated with disease are of inter-est. Currently, SNP data from 51 genes are available.

Page 2: Logic Regression and Interactions in High Dimensional ...iruczins/presentations/ruczinski.03.03.jhu... · Logic Regression and Interactions in High Dimensional Genomic Data ... STAGE

Motivation

[With Brian Caffo, Steve Goodman, and Giovanni Parmigiani]

Associations between chromosomal deletions and stages of blad-der cancer:

STAGE � 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4

P4 0 0 0 0 1 0 1 0 0 0 1 1 0 1 0 0 1 0 0 1 0Q5 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 0P8 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 1 1 0P9 0 1 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 0Q9 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 �����

P11 0 0 0 1 0 1 0 1 0 0 1 1 1 1 0 0 1 0 1 1 0Q13 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 1 1Q14 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 1 0 0 0 0P17 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 1 0Q18 0 0 0 0 0 1 1 0 0 0 1 1 1 1 0 0 1 0 1 1 0

Motivation

[With Tony Alberg]

Figure 2. DNA repair genotypes in relation to NMSC: cross-sectional and prospective

cohort comparisons

2007 1989

No history of melanoma or

noncutaneous malignancies

Prospective cohort comparison of DNA repair

genotypes in relation to NMSC risk

No history

of NMSC

N=27,279

1st time diagnosis

of NMSC prior

to other cancer

N=493

Cross-sectional

comparison of

DNA repair

genotypes in

1989

NO

NMSC

N=27,772

NMSC

N=709

Page 3: Logic Regression and Interactions in High Dimensional ...iruczins/presentations/ruczinski.03.03.jhu... · Logic Regression and Interactions in High Dimensional Genomic Data ... STAGE

Table 1. Amino Acid Substitution Variants Identified in DNA Repair and

Repair-Related Genes (Source: Mohrenweiser et al 2002, and Goode et al 2002)

* Amino acid substitution variants of these SNPs have not been published. However, these nucleotide substitutions

occur in a gene of particular interest (see section B.2.c.), and have been found to associate strongly with risk of

prostate cancer. (Goode et al 2002)

Gene Name Exon Codon

Common

Residue

Variant

Residue

Allele

Frequency

Base Excision Repair

ADPRT 17 761 Val Ala 0.18

APE1 5 148 Asp Glu 0.33

OGG1 7 326 Ser Cys 0.15-0.45

OGG1 Nucleotide 7143* A G 0.15

OGG1 Nucleotide 11657* A G 0.15

POLD1 1 19 Arg His 0.12

POLD1 3 119 Arg His 0.15

POLD1 4 173 Ser Asn 0.05

XRCC1 6 194 Arg Trp 0.13

XRCC1 10 399 Arg Gln 0.24

Nucleotide Excision Repair

ERCC2 10 312 Asp Asn 0.4

ERCC2 23 751 Lys Gln 0.32

ERCC4 8 415 Arg Gln 0.06

ERCC5 15 1104 Asp His 0.18

RAD23B 7 249 Ala Val 0.10

XPC 8 499 Ala Val 0.24

XPC 15 939 Lys Gln 0.38

Double Strand Break/Recombination Repair

NBS1 5 185 Gln Glu 0.34

XRCC2 3 188 Arg His 0.05

XRCC3 7 241 Thr Met 0.43

XRCC4 5 247 Ala Ser 0.08

Damage Recognition, Repair and Cell Cycle Check point CDKN2A 2 148 Ala Thr 0.05

RAD52 8 287 Ser Asn 0.05

MLH1 8 219 lle Val 0.12

Mismatch Repair

MSH3 10 514 Glu Lys 0.05

MSH3 21 940 Arg Gln 0.1

MSH3 23 1036 Thr Ala 0.3

MSH6 1 39 Gly Glu 0.24

Motivation

Lucek and Ott (1997):

“Current methods for analyzing complex traits include analyzingand localizing disease loci one at a time. However, complex traitscan be caused by the interaction of many loci, each with varyingeffect.”

“ ����� patterns of interactions between several loci, for example, dis-ease phenotype caused by locus

�and locus � , or

�but not

� , or�

and ( � or � ), clearly make identification of the involvedloci more difficult. While the simultaneous analysis of every singletwo-way pair of markers can be feasible, it becomes overwhelm-ingly computationally burdensome to analyze all 3-way, 4-way to�

-way ’and’ patterns, ’or’ patterns, and combinations of loci.”

Page 4: Logic Regression and Interactions in High Dimensional ...iruczins/presentations/ruczinski.03.03.jhu... · Logic Regression and Interactions in High Dimensional Genomic Data ... STAGE

Logic Regression

X 1 � ����� � X k are 0/1 (False/True) predictors.

Y is a response variable.

Fit a model g(E(Y )) ��

0 �t

j=1

�j � L j � where L j is a Boolean

combination of the covariates, e.g. L j = (X 1 � X 2) � X c4.

Determine the logic terms L j and estimate the�

j simultaneously.

Logic Trees

An equivalent representation of (X 1 � X c2) � (X3 � (X c

1 � X 4)) isthe following:

1 4

1 2 3 or

and and

or

This is a Logic Tree!

Page 5: Logic Regression and Interactions in High Dimensional ...iruczins/presentations/ruczinski.03.03.jhu... · Logic Regression and Interactions in High Dimensional Genomic Data ... STAGE

Comparison to Decision Trees

Decision Tree

01

C

01

A

0

01

B

0 1

01

D

01

A

0

01

B

0 1

1

Logic Tree

A B C D

and and

or

A Decision Tree (CART) is something different!

The Move Set

Possible Moves

4 3

1 or

and

Alternate Leaf

(a)

2 3

1 or

or

Alternate Operator

(b)

2 3

5 or

1 and

and

Grow Branch

(c)

2 3

1 or

and

Initial Tree

2 3

or

Prune Branch

(d)

3 6

2 and

1 or

and

Split Leaf

(e)

1 2

and

Delete Leaf

(f)

Page 6: Logic Regression and Interactions in High Dimensional ...iruczins/presentations/ruczinski.03.03.jhu... · Logic Regression and Interactions in High Dimensional Genomic Data ... STAGE

Simulated Annealing for Logic Regression

We try to fit the model g(E(Y )) ��

0 �t

j=1

�j � L j �

� Select a scoring function (RSS, log-likelihood, ��� � ).� Pick the maximum number of Logic Trees.� Pick the maximum number of leaves in a tree.� Initialize the model with L j � �

for all j.� Carry out the Simulated Annealing Algorithm:

– Propose a move.

– Accept or reject the move, depending on the scores and the temperature.

Bladder Cancer Example

170

180

190

200

size

devi

ance

0 1 2 3 4 5 6 7 8 9 10

Page 7: Logic Regression and Interactions in High Dimensional ...iruczins/presentations/ruczinski.03.03.jhu... · Logic Regression and Interactions in High Dimensional Genomic Data ... STAGE

Q5

Q5 Q14

and

Q18 Q14

and Q5

or

P4 Q18

and Q14

Q5 or

and

P4 Q18

Q14 and

Q5 or

and

Q18

Q18 Q13

and P4

Q18 P4

Q14 and

or Q5

and

Q5 Q5

Q14 Q18

and Q5

or Q18

Page 8: Logic Regression and Interactions in High Dimensional ...iruczins/presentations/ruczinski.03.03.jhu... · Logic Regression and Interactions in High Dimensional Genomic Data ... STAGE

Q18

P9 P11

and Q18

Q5 Q18 Q5

P4 Q5

Q14 Q13

and Q9

and

A Global Randomization Test of Association

c(0, 14)

c(0,

11)

0 2 4 6 8 10 12 14

02

46

810 X Y Perm(Y)

permutation

1

0

Page 9: Logic Regression and Interactions in High Dimensional ...iruczins/presentations/ruczinski.03.03.jhu... · Logic Regression and Interactions in High Dimensional Genomic Data ... STAGE

A Sequential Randomization Test for Model Size

c(0, 16)

c(0,

11)

0 5 10 15

02

46

810 X T Y Perm(Y)

permutation

permutation

1

0

1

0

1

0

A Sequential Randomization Test for Model Size

0

1

2

3

4

5

0.698 0.700 0.702 0.704 0.706 0.708 0.710 0.712 0.714 0.716 0.718 0.720

Page 10: Logic Regression and Interactions in High Dimensional ...iruczins/presentations/ruczinski.03.03.jhu... · Logic Regression and Interactions in High Dimensional Genomic Data ... STAGE

Sequential Randomization Test for 2 Trees:

c(0, 18)

c(0,

11)

0 5 10 15

02

46

810 X T1 T2 Y Perm(Y)

1

0

1

0

1

0

1

0

1

0

1

0

1

0

permutation

permutation

permutation

permutation

Genetic Analysis Workshop GAW 12

c(0, 22)

c(0,

12)

0 5 10 15 20

02

46

810

12

G1 G2 G3 G4 G5

Q1 Q2 Q3 Q4 Q5

E2

E1

Affection Status

G6

Page 11: Logic Regression and Interactions in High Dimensional ...iruczins/presentations/ruczinski.03.03.jhu... · Logic Regression and Interactions in High Dimensional Genomic Data ... STAGE

Genetic Analysis Workshop GAW 12

logit(affected) =�

0 ��

1 � ENV1 ��

2 � ENV2 ��

3 � GENDER �� K

i=1�

i+3 � L i

G2.DS4137

G2.DS13049

G1.RS557

L =1

or

and G6.DS5007

L =2

G1.RS76

G2.DS861

orL =3

Multiple Models

1010

1020

1030

1040

1050

1060

1070

temperature

scor

e

10 1 0.1

5

4

3

2

1

Page 12: Logic Regression and Interactions in High Dimensional ...iruczins/presentations/ruczinski.03.03.jhu... · Logic Regression and Interactions in High Dimensional Genomic Data ... STAGE

Multiple Models

Let � S be the score of a certain state S.

� We use the acceptance function� ( � old � � new � t) = min � 1,exp([ � old � � new] � t) �

� If we keep the temperature constant, this defines a homoge-neous Markov chain.

� We constructed the move set to be irreducible and aperiodic,therefore each homogeneous Markov chain has a limiting dis-tribution � t(S).

Multiple Models

Simulate 10 binary predictors X 1 � ����� � X 10.

Let Y = 5 � 1 � L(X 1 � X 2 � X 3 � X 4) � � � � � N(0,1) �

Run a homogeneous Markov chain during “crunch time” for twoseparate cases:

Case 1: All X are independent.

Case 2: All X are independent, except X 4 (in the signal) andX 5 (not in the signal), which are heavily correlated.

Page 13: Logic Regression and Interactions in High Dimensional ...iruczins/presentations/ruczinski.03.03.jhu... · Logic Regression and Interactions in High Dimensional Genomic Data ... STAGE

Multiple Models

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10

predictors

0.0

0.2

0.4

0.6

0.8

1.0

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10

predictors

0.0

0.2

0.4

0.6

0.8

1.0

Multiple Models

SNPs

0.0

0.2

0.4

0.6

0.8

1.0

Page 14: Logic Regression and Interactions in High Dimensional ...iruczins/presentations/ruczinski.03.03.jhu... · Logic Regression and Interactions in High Dimensional Genomic Data ... STAGE

Statistical Issues: Missing Values

Patient

SN

Ps

1

11

21

31

41

51

100 200 300 400 500

Statistical Issues: Power

0.00 0.02 0.04 0.06 0.08 0.10

0.0

0.2

0.4

0.6

0.8

1.0

n1 = 709 n2 = 27772

difference in proportions

pow

er

0.05

0.50.15

0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

0.0

0.2

0.4

0.6

0.8

1.0

n = 254

hazards ratio

pow

er

0.050.5

0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

0.0

0.2

0.4

0.6

0.8

1.0

n = 493

hazards ratio

pow

er

0.05

0.050.5

0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

0.0

0.2

0.4

0.6

0.8

1.0

n = 2304

hazards ratio

pow

er

0.05

0.050.5

Page 15: Logic Regression and Interactions in High Dimensional ...iruczins/presentations/ruczinski.03.03.jhu... · Logic Regression and Interactions in High Dimensional Genomic Data ... STAGE

References

� Kooperberg, C., Ruczinski, I., LeBlanc, M., and Hsu, L. (2001),Sequence Analysis using Logic Regression,Genetic Epidemiology, 21 (S1), 626-631.

� Ruczinski, I., Kooperberg, C., and LeBlanc, M. (2002),Logic Regression - Methods and Software,Proceedings of the MSRI workshop on Nonlinear Estimation and Classification(Eds: D. Denison, C. Holmes, M. Hansen, B. Mallick, B. Yu), Springer.

� Ruczinski, I., Kooperberg, C., and LeBlanc, M. (2003),Logic RegressionJournal of Computational and Graphical Statistics (to appear).Available at: http://biostat.jhsph.edu/ � iruczins/

The Bibliography of this paper contains all the references in this presentation.


Recommended