+ All Categories
Home > Documents > Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

Date post: 18-Jan-2018
Category:
Upload: melinda-morgan
View: 217 times
Download: 0 times
Share this document with a friend
Description:
Association Studies Association is a statistical term that describes the co-occurrence of alleles or phenotypes. An Allele A is associated with disease D, if people with D have a different frequency of A than people without D.
39
Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237
Transcript
Page 1: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

Gene mapping by association

3/4/04Biomath/HG 207B/Biostat 237

Page 2: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

Linkage has its limits

To determine that a trait is closer to marker 1 than marker 2, we need to see recombination between marker 2 and the trait locus.

As distance between the markers decreases the number of informative meioses needed to see recombination increases.

At some point Linkage analysis because impractical because too many families are needed.

Page 3: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

Association Studies

• Association is a statistical term that describes the co-occurrence of alleles or phenotypes.

• An Allele A is associated with disease D, if people with D have a different frequency of A than people without D.

Page 4: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

• best: allele increases disease susceptibility– candidate gene

studies

• good: some subjects share common ancestor – linkage

disequilibrium studies

Possible causes for allelic association

D

A1D

M

K

AllelesLoci

Under linkage equilibriumP(A,D)=P(A)*P(D)Violation of the equality is termed linkage disequilibrium

Page 5: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

Linkage Disequilibrium

d DA a

d daA

d daA

D dAa

d dAA

d dAA

Suppose one of the population founders carries an allelic variant that increases risk of a disease. The disease gene is very close to a marker so is very small.

.

.

.

Over many generations (n), there is occasionally recombination between the two genes. So that the population looks like:

Note that D isassociated with a. P(a|D) is close to one.

D dAa

d dAA

d daA

D dAa

d daA

d dAA

D dAA

d dAA

The degree of association between D and a has decreasedP(a|D) but still P(a|D) > P(a). P(a D)>p(a)P(D)

Ancestral haplotypesare dA, da, and Da

Page 6: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

The Degree of Association Between Two Genes Depends on the Distance Between them and

the Age of the Population

1. Let aD= P(aD)-P(a)P(D) and similarly for other alleles.

aD(n) = aD(0)(1-)n

2. At linkage equilibrium P(a/a|D/d)=P(a/a|d/d)=P(a/a|D/D)=P(a/a) P(A/a|D/d)=P(A/a|d/d)=P(A/a|D/D)=P(A/a) P(A/A|D/d)=P(A/A|d/d)=P(A/A|D/D)=P(A/A)Violation of these equalities is evidence of linkagedisequilibrium.

Page 7: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

Allelic association studies test whether alleles are associated

with the trait• 2 types of association tests– population-based

association test • cases and controls are

unrelated• cross-classify by genotype• use 2 test or logistic regression

– family-based association tests• cases and controls are related:

parents, sibs etc• often based on allele

transmission rates• prime example TDT

Page 8: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

Mapping Genes using a Case Control Design

1. Example: Non insulin dependent diabetes in Pimaindians is associated with human immunoglobulingene, Gm. (Knowler et al., 1988)

genotype Cases Controls Total1/1 or 1/2 23 (.0169) 270(.0760) 293

2/2 1343(.983) 3284(.924) 46271366 3554 4920

2 =61.6 p <0.00005.

2. What can go wrong? Association could be due toethnic differences among cases and controls -population stratification.

Page 9: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

A Dramatic Example of when Association is due toPopulation Stratification

1. The Gm genotype differs by degree of caucasianheritage

Genotype >50% <50% Total1/1 or 1/2 184 (.441) 109 (.0242) 293

2/2 233 4394 4627417 4503 4920

2 =1185.5 p <0.00005

2. Diabetes prevalence differs by caucasian heritageDiabetes >50% <50% Total

Yes 20 (.0146) 1346 (.112) 1366No 397 3157 3554

417 4503 4920

3. Controlled for age and degree caucasian background,diabetes and Gm are not significantly associated.(Knowler et al., 1988).

Page 10: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

How concerned should we be aboutpopulation stratification invalidating

case/control results?

1. The allele frequencies and disease prevalence rarely differas dramatically by race as in the example.

2. Good epidemiological methods can reduce the problem.Collect information on racial/ethnic background

3. Sometimes there is no alternative to a case/control design.Family controls may not be available.

On the other hand, 1. Better safe than sorry - Family based control designs

2. Family based designs require more genotyping but not more phenotyping than case/control

Page 11: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

A/a A/A

then the child’s genotype is equally likely to be A/a or A/A

The Transmission Disequilibrium Test eliminatesconcern over false positives due to population

stratification

A simple illustration of the TDT:

Collect parent-child triosIf the child is chosen without regard to disease status

Spielman et al., 1993Terwilliger and Ott, 1992

Page 12: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

d DA a

d dAA

D dAa

However, if the child is chosen because they are affectedand the marker allele a is associated with the disease allele D

then the child is more likely to have the A/a genotypeat the marker than the A/A genotype.

Page 13: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

Testing for Transmission Distortion (Disequilibrium) A biallic locus 1. Select individuals with the disease, genotype these

individuals as well as their parents. 2. Determine how many heterozygous parents transmit A

and how many transmit a. 3. Under the null hypothesis, the probability that a parent

with the A/a genotype transmitted an A is ½. 4. Also under the null hypothesis, the maternal and

paternal transmissions are independent. 5. In the case where there are only two alleles at the

marker the test is equivalent to a McNemar test. Transmitted/ Not transmitted

A a

A C11 C12 a C21 C22 Test statistic T= (C21-C12)2/(C21+C12) For large samples and under the null hypothesis, T has a chisquare distribution (df=1)

Page 14: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

What are we testing with the TDT?

21/ :

21// :

aaAPH

AaAPaaAPH

Alt

O

For a single affected child per family, the null and alternativehypotheses are equivalent to:

0 and 21 :

0or 21 :

Alt

O

H

H

When more than one affected child per family is used,the TDT confounds linkage and association. Thus little is gained by running the TDT on a data set consisting of several very large pedigrees if linkage of the trait and markerhas already been established. With many small unrelatedpedigrees information on association can still be gained.

A strongly positive result suggests that the marker tested isa trait susceptibility locus or that the marker is closely linkedto a trait susceptibility locus.

Page 15: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

Allele transmitted 1 2 k-1 k

not-transmitted1 ----- C1,2 . . . C1,k-1 C1,k n1

2 C2,1 ----- . . . C2,k-1 C2,k n2

. . .

. . .

. . .k-1 Ck-1,1 Ck-1,2 . . . ----- Ck-1,k nk-1 k Ck,1 Ck,2 . . . Ck,k-1 ------- nk

t1 t2 . . . tk-1 tk

The TDT has been extended to multiple alleles per locus

Ho = transmission to affected child is not dependent on allele typeHa = transmission to affected child depends on allele type

ti represents the column sum omitting the diagonal term, ni the row sum also omitting the diagonal. Test statistics include

k

i ii

iimh nt

ntkkT

1

21

ii

ii

ki ntnt

TDT

2

12 max

Mendel’s TDT1 isproportional tothis statistic.

Page 16: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

Allele transmitted1 2 3 4 5 n

not-trans.1 6 4 4 5 192 6 --- 5 7 4 223 8 7 --- 7 5 274 8 5 5 --- 6 245 7 8 7 6 --- 28t 29 26 21 24 20 120

Under some conditions, Tmh is asymptoticallydistributed as chi-square with degree of freedom k-1

Numerical example: data from a locus with 5 alleles. 120 transmissions fromheterozygous parents to affected children.

Tmh = ?TDT2 = ?

Is there evidence of transmission distortion?

Page 17: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

MENDEL determines significance using permutation procedures

Why? If the sample size is small or alleles are rare, theTDT statistic distribution is poorly approximately by a chi-square distribution.

How?(1) For each iteration (usually 10,000 or more)(a) Calculate a new TDT table. Hold the parental genotypes fixed. For eachchild, designate with equal probability that the child gets one of the parentalalleles.(b) Calculate the TDT statistic and determine if larger than the observed TDT statistic.(2) The p-value is equal to the number of iterations in which the TDT statistic is larger than the observeddivided by the total number of iterations.

What is the reason for the standard error?Permutation p-values are estimated using Monte Carlosimulation with a finite number of iterations.

Page 18: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

TDT Summary

• ignores transmissions from homozygous parents

• with two alleles it has an approximate chi-square(1) distribution (McNemar test) – but exact p-values can be computed from the

Binomial(p=.5) distribution in the bi-allelic case• If there is one affected per nuclear family this

tests the null: no linkage or no association– If test is significant, there is linkage and association

• If there are multiple affecteds, the TDT will confound linkage and association owing to the dependencies of the trios. – users should not expect new insight when the data

consists of one or two large disease pedigrees already showing linkage

– with many small unrelated pedigrees, the chance of confusing linkage with association becomes less of an issue, and the TDT can help in identifying associated marker alleles.

Page 19: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

Limitations of the original TDT

(1) Nuclear Families (2) Qualitative traits(3) Codominant markers

Many methods for extending the TDT havebeen developed.

We will discuss one in detail, the gamete competitionmodel.

Page 20: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

One way to extend the TDT:

Lange (1988), Jin et al. (1994), and Sham and Curtis (1995)considered a model (Bradley Terry, 1952) that was originallyused to predict to rank teams the outcome of team sports.

How does the model work?

Look at specific example:Suppose we are interested in predicting the outcomeof a playoff game where the Diamond Back play the Dodgers. Or suppose we want to know the probability that Dodgerswill be the National League West winners this year if we consider regular season results for last year?

Page 21: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

Suppose results are: winner

D’backs Dodgers Giants Rockies PadresLoser D’Backs --- 6 4 4 5

Dodgers 6 --- 7 5 4

Giants 8 5 --- 5 6

Rockies 8 7 7 --- 5

Padres 7 8 6 7 ---Let D’backs/Dodgers Dodgers denote the event that the D’backs and Dodgers play and the Dodgers win.

In general for each team i, we assign a win parameter i so that the probability that i beats j is:

ji

iijiP

)/(

Page 22: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

Bradley - Terry Model of Competing Sports Teams

Note that multiplying each i by any >0 does not changeits value, so one i can be fixed at 1. We fix d’backs = 1. Note that if i > j for all j then i is the best team

Let yij denote the number of times that i plays j andi wins. For example, the D’backs beat the Giants 8 times and the Giants beat the D’backs 4 times (yij = 8 and yji = 4). The win parameters can be determined using the following recurrence relationship

ji jjii

ijmj

mi

jiij

ijij

mi yy

y

)ln(lnyln(L)

ij

1

where the loglikelihood is

ji

iijiP

)/(

Page 23: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

winnerD’backs Dodgers Giants Rockies Padres

Loser D’Backs --- 6 4 4 5Dodgers 6 --- 7 5 4Giants 8 5 --- 5 6 Rockies 8 7 7 --- 5Padres 7 8 6 7 ---

Ho = all teams are equally likely to win (i = 1 for all i)

LRT = 3.63, the p-value of 0.46 supports acceptance of the null hypothesis.

21)/( ijiP

RESULTS

Page 24: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

We get the relative rankings. dodgers = 1.23, d’backs =1.00, giants = 0.87, rockies = 0.71, padres = 0.67

With these rankings we can make predictions about theoutcomes of games:

We get more information from this analysis

59.10.223.1)/(

55.23.223.1)'/(

dodgersdodgersgiantsP

dodgersbacksddodgersP

Note that these probabilities are different from the predictionsif we just used the individual match up records. The estimate isnot 8/12 =.67 for dodgers beating giants

Page 25: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

How is this sports analysis analogous to TDT?

Think of :(1) Each possible allele at locus = a team

(2) A heterozygous parent = a match up

(3) Allele received by child from a heterozygous parent = the winner of the game

(4) The transmission parameters = the win parameters

(5) The win/lost record is determined by the transmissions from heterozygous parents.

Page 26: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

transmitted 1 2 3 4 5

not trans. 1 --- 6 4 4 52 6 --- 7 5 43 8 5 --- 5 6 4 8 7 7 --- 55 7 8 6 7 ---

When we ignore disease status, the Bradley- Terry modelprovides a form of segregation analysis.

When we consider the transmission to affected members only (like this example) we have a form of TDT analysis.

Page 27: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

The gamete competition likelihood for a pedigree The general form of the gamete competition likelihood for a pedigree with n individuals is

),|(Tran)(Prior)|(...1 },,{

lkmG nG i j mlk

jii GGGGGXPenL

Here person i has marker phenotype Xi and underlying marker genotype Gi. For founders , Prior(Gj) For offspring, the transmission probability factors Tran(Gm | Gk, Gl)= Tran(Gmk | Gk )*Tran(Gml | Gl ) Tran(Gmk | Gk) = mk/(mk +nk) and Tran(Gml | Gl) = ml/(ml +nl) The penetrance, Pen(Xi| Gi) is always 1 or 0, depending on whether Xi and Gi are consistent or inconsistent

Page 28: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

Assessing significance

We use a likelihood ratio test statistic

LRT = 2*( ln(LHa)-ln(LHo) )Where LHa and LHo are the maximum likelihoods under the alternative and null hypotheses.

Significance?

Approximate p-values can be calculated by assuming a the distribution is chi-square or by gene dropping.

Page 29: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

(1) Gamete Competition works on extended pedigreesNo need to break up large families into nuclear families.

(2) If have only trios, the gamete competition and the TDT are equivalent. Their null hypothesis is no linkage or no association. The alternative hypothesis is linkage and association.

(3) When considering more than one affected perfamily, the TDT and gamete competition confoundassociation with linkage.

(4) Exact p-values can be determined with the TDT. Gamete competition p-values are asymptotic.

(5) The gamete competition model can be used when there is missing marker information. Allele frequencies can be fixed at population estimates or estimated along with the’s.

(6) When there is missing data, the gamete competition is not immune to the effects of population stratification or rare alleles.

Gamete Competition contrasted with the TDT

Page 30: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

Example:Families affected with Noninsulin Dependent Diabetes and linkage to a marker within the sulfonyl urea receptor-1 gene

27 Mexican-American extended pedigrees with 74 affected offspring (all genotyped) at SUR

The likelihood ratio test statistic is 9.133 with 9 degreesof freedom. P-value =0.043

allele 1 2 3 4 5 6 7 8 9 10

freq .054 .210 .190 .048 .047 .108 .140 .091 .071 .042

i.288 1.00 .810 1.40 .697 .383 .556 .567 .499 .082

se ofi

.215 fixed .447 .985 .681 .204 .288 .322 .509 .104

Page 31: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

Can this model be extended to quantitativetraits?

Yes by recognizing that the Bradley – Terry Model isequivalent to a matched case control design. Thetransmitted allele is the case, the untransmitted allele isthe control.

pxii e

where xp denotes child p’s standardized trait value, i denotes allele i and the probability of an i/j heterozygous parent transmitting i is

1)/( )(

)(

pxji

pxji

eeijiP

Note that one is set to zero.

This is equivalent to conditional logistic regression.

Page 32: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

Quantitative Trait Example: ACE

High ACE concentration is associated with a deletion within an intron of the ace gene.

404 people in 69 families (Sinsheimer et al., 2000).

0.1)/()/(

11)/(

1)/(

insertiondeletioninsertionPdeletiondeletioninsertionP

einsertiondeletioninsertionP

eedeletiondeletioninsertionP

kxdeletion

kxdeletion

kxdeletion

insertion deletionmle 0.00 1.31s.e. of mle fixed 0.17

Ho: deletion = 0 Ha: deletion 0LRT = 82.76 Asymptotic p-value < 1 x 10-19

Page 33: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

Another Example: Analyzing tightly linked SNPs:

SNPs (single nucleotide polymorphisms) tend to be more stable and more abundant than microsatellite markers.

They are predominately biallelic, so we would like to use severaltightly linked markers simultaneously to increase the overall information content.

Recall that we use the allele transmissions from heterozygous parents.

Assuming HWE, the maximum possible % of heterozygous parents for biallelic system is 0.50. For an n allele system, itis H=(n-1)/n. More alleles more information.

Page 34: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

The phase of these multilocus SNPs may not be known:

Example: suppose there are three SNPs.

An individual with multilocus genotype 1/2, 1/2, 1/2 could have one of the following haplotypes:(1) 111 and 222, (2) 122 and 211 (3) 121 and 212 or (4) 112 and 221.

The gamete competition allows the use of non-codominant markers so we don’t need to determine which of these haplotypes combinations is present in a particular individual.

Page 35: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

If we are using tightly linked SNPs, then is effectively zero and the transmission probability reduces to:

For two linked loci associated with a quantitative trait, the transmission probability is expressed as:

pxkjepxilepxklepxije

pxije

ijklijP

)1(

)1(

)/(

pxklepxije

pxijeijklijP

)/(

Page 36: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

An Example

Again we use sex adjusted ACE levels as a quantitative trait.

The three SNPs are labeled by their position and the nucleotides present at the position. A-240T, T1237C, and G2350A. Because the ACE gene spans only 26kb, the recombination fractions between these SNPs are effectively zero.

The pedigree data consist of 83 white British families ranging in size from 4 to 18 members. ACE levels were determined on 405 family members. Genotypes were collected on 555 family members.

Page 37: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

In MENDEL, the most important difference from the previous example will be observed in the locus file.

We need to allow for phase ambiguities (lack of certainty in haplotypes).

L469 AUTOSOME 627 <-# haplotypes,# phenotypes ATA 0.40190 ATG 0.00780 ACA 0.06740 ACG 0.18310 T*A 0.01340 !T*A corresponds to haplotypes TTA and TCA T*G 0.32640 !T*G corresponds to haplotypes TTG and TCG We are no longer assuming co-dominant markers so we must specify the phenotype (of the marker) / genotype relationship.These phenotypes correspond to the marker phenotypes usedin the pedigree file.

Page 38: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

Haplotype ATA ATG ACA ACG TTA+TCA TTG+TCGHopijk .4052 .0079 .0676 .1839 .0133 .3321s.e. of pijk .0257 .0045 .0136 .0199 .0059 .0240Hapijk .4019 .0078 .0674 .1831 .0134 .3264s.e. of pijk .0256 .0024 .0136 .0198 .0059 .0242ijk .0000 .2440 .2137 1.169 .2765 1.528

s.e. of ijk fixed .9893 .4076 .2352 .5848 .2189

Log-likelihood under Ho = -704.34Log-likelihood under Ha = -663.73

LRT=81.22 df = 5 p-value = 4.67 x 10-16

RESULTS

Page 39: Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

Many other extensions / alternatives to the TDT have been developed.

These include:

TDT using sibling controlsSib-TDT (Spielman and Ewens, 1998)DAT (Boehnke and Langefeld, 1998)SDT (Horvath and Laird)

TDT for quantitative traitsAllison (1997), Rabinowitz (1997), Abecasis (2000)

Joint modeling of linkage and associationthat allow estimation of recombinationHastabacka (1992)Kaplan, Hill and Weir (1995)Terwilliger (1995)


Recommended