SYLLABUS INTRODUCTION TO PLANT QUANTITATIVE …1 Lecture 1 Introduction to Modern Plant Breeding...

SYLLABUS INTRODUCTION TO PLANT QUANTITATIVE GENETICS

Tucson, 8 – 10 Jan. 2018

INSTRUCTORS:

Mike Gore, Cornell University, [email protected] Lucia Gutierrez, Department of Agronomy, University of Wisconsin, Madison [email protected] Bruce Walsh, Department of Ecology & Evolutionary Biology, University of Arizona [email protected]

References B = Bernardo,Breeding for Quantitative Traits in Plants, 2nd ed. LW = Lynch & Walsh: Genetics and Analysis of Quantitative Traits (book) WL = Walsh & Lynch: Evolution and Selection of Quantitative Traits (website) http://nitro.biosci.arizona.edu/zbook/NewVolume_2/newvol2.html

LECTURE SCHEDULE Monday, 8 Jan 8:30 10:00 am 1. Introd to Modern Plant Breeding (Gore, Gutierrez, Walsh) Background reading: B Chapter 1 10:00 10:30 am Break 10:30 12:00 am 2. Basic Genetics (Walsh, Gore) Background reading: LW Chapter 4 12:00 1:30 pm Lunch 1:30 3:00 pm 3. Basic Statistics (Walsh) Background reading: LW Chapters 2, 3 Additional reading: LW Appendix A4 3:00 3:30 pm Break 3:30 5:00 pm 4. Allelic Effects and Genetic Variances (Walsh) Background reading: B Chapters 3, 6 Additional reading: LW Chapters 4, 5 Tuesday, 9 Jan 8:30 10:00 am 5. Resemblance Between Relatives (Walsh) Background reading: B Chapter 6

Additional reading: LW Chapter 7 10:00 10:30 am Break 10:30 12:00 am 6. Heritability and Field Designs (Gutierrez) Background reading: B Chapters 6, 7

Additional reading: LW Chapters 17, 18, 20, 22

Holland, J., W.E. Nyquist, C.T. Cervantes-Martinez. 2010. Estimating and Interpreting Heritability for Plant Breeding: An Update. Plant Breeding Reviews 22: 9-112.

12:00 1:30 pm Lunch 1:30 3:00 pm 7. QTL Mapping (Gutierrez) Background reading: B Chapter 5

Additional reading: LW Chapters 12-15 3:00 3:30 pm Break 3:30 5:00 pm 8. Association Mapping (Gore) Background reading: B Chapter 5.4

Additional reading: LW Chapter 16 Wednesday, 10 Jan 8:30 10:00 am 9. Inbreeding, Heterosis (Gore) Background reading: B Chapter 12

Additional reading: LW Chapter 10, 10:00 10:30 am Break 10:30 12:00 am 10. Mass and Family Selection (Walsh) Background reading: B Chapters 9, 10 Additional reading: WL Chapters 12, 13, 19, 20, 35

ADDITIONAL BOOKS ON QUANTITATIVE GENETICS

General Falconer, D. S. and T. F. C. Mackay. Introduction to Quantitative Genetics, 4th Edition

Lynch, M. and B. Walsh. 1998. Genetics and Analysis of Quantitative Traits. Sinauer.

Mather, K., and J. L. Jinks. 1982. Biometrical Genetics. (3rd Ed.) Chapman & Hall.

Plant Breeding

Wricke, G., and W. E. Weber. 1986. Quantitative Genetics and Selection in Plant Breeding. De Gruyter.

Mayo, O. 1987. The Theory of Plant Breeding. Oxford.

Stoskopf, N. C.. D. T. Tomes, and B. R. Christie. 1993. Plant breeding: Theory and practice. Westview, Boulder.

Sleper, D. A., and J. M. Poehlman. 2006. Breeding Field Crops. 5th Edition. Blackwell

Bernardo, R. 2010. Breeding for Quantitative Traits in Plants, 2nd Ed Stemma Press.

Hallauer, A. R., M. J. Carena, and J. B. Miranda Filho. 2010. Quantitative Genetics in Maize Breeding. Iowa State Press.

Statistical and Technical Issues Bulmer, M. 1980. The Mathematical Theory of Quantitative Genetics. Clarendon Press.

Kempthorne, O. 1969. An Introduction to Genetic Statistics. Iowa State University Press.

Sorensen, D., and D. Gianola. 2002. Likelihood, Bayesian, and MCMC Methods in Quantitative Genetics. Springer.

Saxton, A. M. (Ed). 2004. Genetic Analysis of Complex Traits Using SAS. SAS Press.

Wu, R., C.-X. Ma, and G. Casella. 2007. Statistical Genetics of Quantitative Traits: Linkage, Maps. and QTL. Springer, N.Y

1

Lecture 1 Introduction to Modern Plant

Breeding

Bruce Walsh Notes Introduction to Plant Quantitative Genetics

Tucson. 8-10 Jan 2018

2 ̀

Importance of Plant breeding

•  Plant breeding is the most important technology developed by man. It allowed civilization to form and its continual success is critical to maintaining our way of life

•  Problem: Feeding 9 billion (+) people with the same (or fewer) inputs –  Same or less acreage –  Same or less fertilizer, pesticides, water –  Adapting to climate and environmental change

3

Goals of Plant breeding •  Increase the frequency of favorable alleles

within a line –  Additive effects

•  Increase the frequency of favorable genotypes within a line –  Dominance and interaction effects

•  Better adapt crops to specific environments –  Region-specific cultivars (high location G x E) –  Stability across years within a region (low year-to-

year G x E)

4

Objectives

•  Development of pure (i.e. highly inbred) lines with high per se performance

•  Development of pure lines with high hybrid performance (either with each other or with a testcross)

•  Less emphasis on developing outbred (random-mating) populations with improved performance

•  Development of lines with high regional G x E, low year G x E

5

Animal and tree breeding •  Similar goals, but since mostly outcrossing,

the goal is to create high-performing populations, not inbred lines

•  Generally speaking, inbreeding is bad in animals and many trees

•  Focus on finding those parents with the best transmitting abilities (highest breeding values)

•  Less of a G x E focus with animals, less of a focus on line and hybrid breeding

6

Special features exploited by plant breeders

•  Selfing allows for the capture of specific genotypes, and hence the capture of interactions between alleles and loci (dominance and epistasis) –  Homozygous for selfed lines –  Heterozygous for crossed lines

•  Often high reproductive output (relative to animal breeding)

•  Seeds allow for multigeneration progeny testing, wherein individuals are chosen on the performance of their progeny, or of their sibs –  Allows for better control over G x E by testing

over multiple sites/years

7

Historical plant breeding •  Early origins

–  Creation of new lines through species crosses (allopolyploids)

–  Visual selection –  Early domestication (selection for specific traits for

ease of harvesting) •  Biometrical school

–  Using crosses to predict average performance under inbreeding or crossing or response to selection

–  Better management of G x E

8

Modern tools •  Molecular markers

–  Initially low density for QTL mapping, introgression of major genes into elite germplasm

–  With high-density markers, association mapping and MAS/genomic selection

•  New statistical tools –  Mixed model methods –  Bayesian approaches to handle high-dimensional data sets –  New methods to deal with G x E

•  Other technologies –  Better standardization of field sites (laser-tilled fields, GPS,

better micro- and macro-environmental measurements) –  High throughput phenotypic scoring –  DH lines

9

Diversity •  Plant breeders face the conundrum of using

inbred lines to concentrate elite genotypes, but requiring a very large collection of such lines to store variation for further selection

•  Landraces or local cultivars may be highly adapted to specific environments, but otherwise not elite

•  Issue with keeping germplasm elite while introgressing genes/regions of interest.

10

Integrated Approaches •  How do we best combine the rich history of

quantitative genetics and classical plant breeding with the new tools from genomics and other advances?

•  Key: Quantitative genetics has all of the machinery needed to fully incorporate these new sources of information

•  The goal of this course is to show how this is done.

1

Lecture 2 Basic Plant Genetics



2

Overview •  Ploidy •  Linkage •  Linkage disequilibrium (LD) •  Genetic markers •  Mapping functions •  Organelle inheritance •  Mating systems and types of crosses •  Gene actions

–  Dominance and Epistasis –  Pleiotropy

3

Ploidy •  Most animals are diploid (2n), with their gametes (eggs, sperm)

containing a haploid set of n chromosomes •  Polyploids are much more common in plants. •  Allopolyploids consist of haploid sets from two (or more)

species –  e.g., an allotetraploid is AABB, –  One allohexaploid is AABBCC –  Generally speaking, allopolyploids largely behave as

diploids, i.e., each pollen/egg gets a haploid set from each of the founding species

•  Autopolyploids have multiple haploid sets from the same species –  Autoteraploids (4n) and autohexaploids (6n) –  these give pollen and eggs with two (2n) and three (3n)

(respectively) copies of each homologous chromosome

4

5

Linkage

•  Independent assortment for unlinked genes

•  Linkage • Computing expected genotypic

frequencies from linkage

6

Dealing with two (or more) genes

For his 7 traits, Mendel observed Independent Assortment

The genotype at one locus is independent of the second

RR, Rr - round seeds, rr - wrinkled seeds

Pure round, green (RRgg) x pure wrinkled yellow (rrYY)

F1 --> RrYg = round, yellow

What about the F2?

7

Let R- denote RR and Rr. R- are round. Note in F2, Pr(R-) = 1/2 + 1/4 = 3/4

Likewise, Y- are YY or Yg, and are yellow

Phenotype Genotype Frequency

Yellow, round Y-R- (3/4)*(3/4) = 9/16

Yellow, wrinkled Y-rr (3/4)*(1/4) = 3/16

Green, round ggR- (1/4)*(3/4) = 3/16

Green, wrinkled ggrr (1/4)*(1/4) = 1/16

Or a 9:3:3:1 ratio

8

Mendel was wrong: Linkage

Phenotype Genotype Observed Expected

Purple long P-L- 284 215

Purple round P-ll 21 71

Red long ppL- 21 71

Red round ppll 55 24

Bateson and Punnet looked at

flower color: P (purple) dominant over p (red ) pollen shape: L (long) dominant over l (round)

Excess of PL, pl gametes over Pl, pL

Departure from independent assortment

9

Linkage

If genes are located on different chromosomes they (with very few exceptions) show independent assortment.

Indeed, peas have only 7 chromosomes, so was Mendel luckyin choosing seven traits at random that happen to all be on different chromosomes?

However, genes on the same chromosome, especially if they are close to each other, tend to be passed onto their offspring in the same configuration as on the parental chromosomes.

10

Consider the Bateson-Punnet pea data

Let PL / pl denote that in the parent, one chromosome carries the P and L alleles (at the flower color and pollen shape loci, respectively), while the other chromosome carries the p and l alleles.

Unless there is a recombination event, one of the two parental chromosome types (PL or pl) are passed onto the offspring. These are called the parental gametes.

However, if a recombination event occurs, a PL/pl parent can generate Pl and pL recombinant chromosomes to pass onto its offspring.

11

Let c denote the recombination frequency --- the probability that a randomly-chosen gamete from the parent is of the recombinant type (i.e., it is not a parental gamete).

For a PL/pl parent, the gamete frequencies are

Gamete type Frequency Expectation under independent assortment

PL (1-c)/2 1/4

pl (1-c)/2 1/4

pL c/2 1/4

Pl c/2 1/4

12

Gamete type Frequency Expectation under independent assortment

PL (1-c)/2 1/4

pl (1-c)/2 1/4

pL c/2 1/4

Pl c/2 1/4

Parental gametes in excess, as (1-c)/2 > 1/4 for c < 1/2

Recombinant gametes in deficiency, as c/2 < 1/4 for c < 1/2

13

Expected genotype frequencies under linkage

Suppose we cross PL/pl X PL/pl parents

What are the expected frequencies in their offspring?

Pr(PPLL) = Pr(PL|father)*Pr(PL|mother) = [(1-c)/2]*[(1-c)/2] = (1-c)2/4

Recall from previous data that freq(ppll) = 55/381 = 0.144

Hence, (1-c)2/4 = 0.144, or c = 0.24

Likewise, Pr(ppll) = (1-c)2/4

14

A (slightly) more complicated case

Again, assume the parents are both PL/pl. Compute Pr(PpLl)

Two situations, as PpLl could be PL/pl or Pl/pL

Pr(PL/pl) = Pr(PL|dad)*Pr(pl|mom) + Pr(PL|mom)*Pr(pl|dad) = [(1-c)/2]*[(1-c)/2] + [(1-c)/2]*[(1-c)/2]

Pr(Pl/pL) = Pr(Pl|dad)*Pr(pL|mom) + Pr(Pl|mom)*Pr(pl|dad) = (c/2)*(c/2) + (c/2)*(c/2)

Thus, Pr(PpLl) = (1-c)2/2 + c2 /2

15

Generally, to compute the expected genotype probabilities, need to consider the frequencies of gametes produced by both parents.

Suppose dad = Pl/pL, mom = PL/pl

Pr(PPLL) = Pr(PL|dad)*Pr(PL|mom) = [c/2]*[(1-c)/2]

Notation: when PL/pl, we say that alleles P and L are in coupling

When parent is Pl/pL, we say that P and L are in repulsion

In class problems •  Suppose c = 0.2

–  In a cross of AB/ab X AB/ab, what is freq(AABB)

– Suppose we cross AB/ab X Ab/aB. What is freq(AABB)?

• Now suppose c is unknown, but in a cross of AB/ab x AB/ab, freq(AABB) = 0.25. – What is c?

16

17

Linkage Disequilibrium •  Under linkage equilibrium, the frequency of gametes

is the product of allele frequencies, –  e.g. Freq(AB) = Freq(A)*Freq(B) –  A and B are independent of each other

•  If the linkage phase of parents in some set or population departs from random (alleles not independent) , linkage disequilibrium (LD) is said to occur

•  The amount DAB of disequilibrium for the AB gamete is given by –  DAB = Freq(AB) gamete - Freq(A)*Freq(B) –  D > 0 implies AB gamete more frequent than expected –  D < 0 implies AB less frequent than expected

18

Dynamics of D

•  Under random mating in a large population, allele frequencies do not change. However, gamete frequencies do if there is any LD

•  The amount of LD decays by (1-c) each generation –  D(t) = (1-c)t D(0)

•  The expected frequency of a gamete (say AB) is –  Freq(AB) = Freq(A)*Freq(B) + D –  Freq(AB in gen t) = Freq(A)*Freq(B) + (1-c)t D(0)

19

AB/ab

Excess of parental gametes AB, ab

linkage

Ab/aB

Excess of parental gametes Ab, aB

AB/ab


Ab/aB


Pool all gametes: AB, ab, Ab, aB equally frequent

No LD: random distribution of linkage phases

20

AB/ab


linkage

AB/ab


AB/ab


Ab/aB


Pool all gametes: Excess of AB, ab due to an excess of AB/ab parents

With LD, nonrandom distribution of linkage phase

21

Molecular Markers

SNP -- single nucleotide polymorphism. A particular position on the DNA (say base 123,321 on chromosome 1) that has two different nucleotides (say G or A) segregating

STR -- simple tandem arrays. An STR locus consists of a number of short repeats, with alleles defined by the number of repeats. For example, you might have 6 and 4 copies of the repeat on your two chromosome 7s

In the molecular era, genetic maps are based not on alleles with large phenotypic effects (i.e., green vs. yellow peas), but rather on molecular markers

Even with whole-genome sequencing, sites are still classified into these two classes (plus other types)

22

SNPs vs STRs SNPs

Cons: Less polymorphic (at most 2 alleles)

Pros: Low mutation rates, alleles very stable Excellent for looking at historical long-term associations (association mapping) Cheap to score 100,000s (+) on a single SNP Chip

STRs

Cons: High mutation rate

Pros: Very highly polymorphic (more information/site) Excellent for linkage studies within an extended pedigree (QTL mapping in families or pedigrees)

23

Genetic maps •  Published genetic maps give the distances

between molecular markers along a chromosome in terms of map units (m, expected number of crossovers between them), rather than their recombination frequencies c. Why? –  c is not additive over loci, while m is –  Hence, m is a more natural metric –  Transition from an observed c to an estimated m

requires a mapping function, which requires an assumption about how interference works

24

Genetic Maps and Mapping Functions The unit of genetic distance between two markers is the recombination frequency, c

If the phase of a parent is AB/ab, then 1-c is the frequency of “parental” gametes (e.g., AB and ab), while c is the frequency of “nonparental” gametes (e.g.. Ab and aB).

A parental gamete results from an EVEN number of crossovers, e.g., 0, 2, 4, etc.

For a nonparental (also called a recombinant) gamete, need an ODD number of crossovers between A & b e.g., 1, 3, 5, etc.

25

Hence, simply using the frequency of “recombinant” (i.e. nonparental) gametes UNDERESTIMATES the m number of crossovers, with E[m] > c

Mapping functions attempt to estimate the expected number of crossovers m from observed recombination frequencies c

When considering two linked loci, the phenomena of interference must be taken into account

The presence of a crossover in one interval typically decreases the likelihood of a nearby crossover

In particular, c = Prob(odd number of crossovers)

26

Suppose the order of the genes is A-B-C.

If there is no interference (i.e., crossovers occur independently of each other) then

Probability(odd number of crossovers btw A and C)

Even number of crossovers btw A & B, Odd number between B & C

odd number in A-B, even number in B-C

cAC = cAB (1-cBC) + (1-cAB) cBC = cAB + cBC – 2 cAB cBC

27

We need to assume independence of crossovers in order to multiply these two probabilities

When interference is present, we can write this as

δ = interference parameter

δ = 1 --> complete interference: The presence of a crossover eliminates nearby crossovers

δ  = 0 --> No interference. Crossovers occur independently of each other

cAC = cAB + cBC – 2(1-δ) cAB cBC

28

Mapping functions. Moving from c to m

Haldane’s mapping function (gives Haldane map distances)

Assume the number k of crossovers in a region follows a Poisson distribution with parameter m

This makes the assumption of NO INTERFERENCE

Pr(Poisson = k) = λk Exp[-λ]/k! λ = expected number of successes

c = 1 X

k = 0 p ( m ; 2 k + 1 ) = e ° m

1 X

k = 0

m 2 k + 1

( 2 k + 1 ) ! = 1 ° e ° 2 m

2

29

Prob(Odd number of crossovers)

Odd number

Usually reported in units of Morgans or centiMorgans (cM)

One morgan --> m = 1.0. One cM --> m = 0.01

c = 1 X

k = 0 p ( m ; 2 k + 1 ) = e ° m

1 X

k = 0

m 2 k + 1

( 2 k + 1 ) ! = 1 ° e ° 2 m

2

m = ° l n ( 1 ° c 2

Relates recombination fraction c to expected number of crossovers m

30

Organelle genetics •  With autosomal loci, each parent contributes

an equal number of chromosomes •  However, the mitochondrial and chloroplast

genomes are only passed from the mother. –  While these have a small number of genes

(mtDNA ~ 20, cpDNA ~ 50-100), they can still have phenotypic effects

–  Example: cytoplasmic sterility factors on mtDNA used in maize to avoid having to detassle pollen plants

31

Systems of matings and types of crosses

•  Types of crosses –  F1, F2, Backcrosses –  Fk, Advanced intercross (AIC) lines –  Isogenic/inbred lines

•  Recombinant inbred lines (RILS)

–  Selfing •  Sk lines

–  Doubled haploids

32

P1 x P2

F1 F1 B1

Backcross design

B2 F1 Backcross design

Cross % P1 % P2

F1 50 50

B1 75 25

B2 25 75

B1 (k) 1-(1/2)k+1 (1/2)k+1

B2(k) (1/2)k+1 1-(1/2)k+1

B1(2) = B1 X P1 Repeating backcrossing to the P1 gives B1(k) = Bk-1 X P1 lines

Fraction genetic contributions from each parent

33

F2-based crosses •  Randomly mating the F1 generates the F2

–  These can also be generated by selfing each F1 •  Isogenic (or inbred) lines are created by taking a set

of F1 individuals and selfing each for 5-10 generations to create a series of inbred lines –  Generates a series of pure lines that capture some of the

initially segregating variation –  When generated following a cross, also called RILs

•  Advanced intercross lines (AIC) are created by randomly-mating the F2 line for multiple generations (AIC(k) = Fk = k generations of random mating) –  Has the effect of expanding the genetic map in

the AIC(k), recombination rate between two markers ~ c*k

34

Selfing, Doubled Haploids •  If two inbred lines are crossed, all of the F1 are

heterozygotes. If we self the F1 for k generations, then the fraction of loci that are heterozygotes is (1/2)k. –  Less that 1% in the F7, 0.09% in F10

–  Sk lines refer to k generations of selfing an F2, e.g., with only selfing Sk = Fk+2

–  S0 = The F2 from selfing an F1 line, •  Doubled haploids, DH, (the doubling of a haploid set in a

gamete) produces fully inbred individuals in one generation –  DH lines capture most of the initial LD (only a single

generation of recombination) –  Selfed-generated lines further decay some LD, but not as

efficiently as random mating.

35

Selfing and Favorable Alleles •  Suppose inbred lines 1 and 2 are each fixed for five

favorable alleles not found in the other. –  In the F1, all individuals carry at least one favorable allele at

each locus –  In the F2, the probability a locus contains at least one

favorable allele is Pr(favorable homozygote) + Pr(favorable heterozygote) = (1/4) + (1/2) = 3/4.

•  Pr(all 10 loci do) = (3/4)10 = 0.056 –  If fully inbred, Pr(at least one favorable allele at a locus) =

1/2 •  Pr(all 10 loci fixed for favorable allele) = (1/2)10 = 1/1024, •  Roughly 57 times less likely than an F2. •  Hence, while inbred lines build up loci fixed for both favorable

alleles, they have less loci with favorable alleles than an F1 or F2.

36

Effects of selection

• Now suppose that selection increases the frequency of each favorable allele from 0.5 to 0.9 – For an inbred, now Pr(all loci fixed for

favorable alleles) = 0.910 = 0.35 – For a random-mating population, Pr(all loci

contain a favorable allele) = •  (1-0.12)10 = 0.904

37

Types of Gene Action •  At a single gene, we can see dominance

–  The heterozygote has a phenotype that is different from the average of the two homozygotes

–  Interaction between the two alleles at a locus •  We can also have pleiotropy, where a single gene

influences two or more traits. •  When two (or more) genes influence the same trait,

the possibility of epistasis exists –  The two-locus phenotype is not simply the sum of

the two single-locus phenotype

38

Epistasis •  Consider the two-locus genotype AiAjBkBl

•  Let Gij.. = Gij denote the average deviation between an AiAj individual and the population mean, same for Gkl

•  If Gijkl = u + Gij + Gkl , i.e., the two-locus genotypic value is the sum of each single locus genotypic values (based on deviations from the mean u), then we same genotypic values are additive across loci (while dominance might still occur at either locus) –  If this is NOT the case, we case that epistasis occurs --- the

two-locus genotype departs from the average contribution of both single loci.

–  Dominance = interaction between alleles at the SAME locus –  Epistasis = interaction between alleles at DIFFERENT loci

39

Example

AA Aa aa

BB 10 15 20

Bb 10 15 20

bb 5 10 15

B is dominant to b, A is additive (no dominance) However, no epistasis, as phenotypic value is (B phenotype) + 5*(# of a alleles), namely the sum of the two genotypic values at each locus

10

10

5

0 5 10

Lecture 2b: The Genetic Determinants of Size in Plants, Animals, and Humans

Mike Gore lecture notesTucson Plant Breeding Institute

Module 1

1

• Pea and corn• Dog• Human


2

• Natural populations: natural selection for plant height toimprove light interception, carbon and nutrient capture, weedcompetition, and seed dispersal

Height adaptations are essential to plant fitness and agricultural performance

• Breeding populations: artificial selection for plant height toincrease harvest uniformity, favorably partition carbon andnutrients, and enhance input use efficiency

Peiffer et al. 2014. Genetics 196:1337-1356 3

Father of Green Revolution: Borlaug developed high yielding, short stature wheat at the International Maize and Wheat Improvement Center (CIMMYT, MX) in 1960s

Starts work in India and Pakistan

His developed fertilizer-responsive wheat varieties growing in Latin America and Asia saved millions, if not a billion, people from starvation. Received 1970 Nobel Peace Prize.

http://rationalwiki.org/wiki/Norman_Borlaug4

Mendel’s Peas 1866

All of Mendel’s seven traits in pea were controlled by single genes that showed independent assortment

http://www.ck12.org/book/CK-12-Biology-Concepts/r11/section/3.1/

5

Lester et al. 1997. Plant Cell 9:1435-1443

Le/Letall

le/ledwarf

Le (stem length) gene controls internode elongation

GA20

GA1bioactive

Gibberellin3β-Hydroxylase

Reduced activity of le is from an alanine-to-threonine substitutionin active site of enzyme

6

Since Mendel’s experiments, plant height has continued to be studied because of its high heritability and ease of measurement in plant populations

nps.gov/history/history/online_books/science/8/chap5.htm

1974 – Steel measuring tape 2014 – Barcoded measuring tape7

More than 40 genes at which mutations have large effects on height have been identified in maize

These 40 genes are mostly involved in hormone (e.g., auxin, gibberellin and brassinosteroid) synthesis, transport, and signaling

“breakdown”“biosynthesis”

Salas Fernandez et al. 2009. Trends Plant Sci. 14:454-461


Evolutionary models predict that loss of unfavorable large-effect alleles is likely as a population approaches optimal fitness/productivity in agricultural systems

Peiffer et al. 2014. Genetics 196:1337-1356

Mu

ltan

iet a

l. 2

00

3. S

cie

nce

30

2:8

1-8

4• Maize brachytic2 (br2)mutants have compactlower stalk internodes

• The height reductionresults from the loss of a P-glycoprotein that modulatespolar auxin transport in themaize stalk

9

In total, 4892 NAM and IBM RILs were scored for PHT in >7 environments

Joint-linkage mapping of QTL for plant height (PHT) in the maize Nested Association Mapping (NAM) panel


H2

Wide range in phenotypic variation for PHT (cm) across families, with transgressive segregation in all families

High heritability across and within NAM familiesPeiffer et al. 2014. Genetics 196:1337-1356

11

The joint-linkage QTL models identified 35 QTL thatexplained ~76% of PHT variation

Largest effect locus: 2.1 ± 0.9% of PHT variation

The QTL effect estimates (4-6 cm) were validated by fine mapping in two near-isogenic line families

The intervals contained >100genes, with no obvious candidates


JL-assisted GWAS with ~30 million SNPs: Resolving the identified QTL associated with height in maize

Peiffer et al. 2014. Genetics 196:1337-1356

The molecular basis of natural PHT variation in maize remains largely elusive

Variation in PHT is well explained by Fisher’s infinitesimal model of genetic architecture

3 out of > 120 candidates for height loci

13



14

Domestic dog breeds exhibit tremendous diversity in body size

http://hondentrimsalonscissors.nl/

Domestic dog originated from the gray wolf 15,000 years ago, but most breeds are only a few hundred years old

15

Within a dog breed: Fine mapping of a major QTL for body size on chr. 15 in Portuguese water dog (PWD)

Sutter et al. 2007. Science 316: 112–115

n = 463

Insulin-like growth factor 1 (IGF1) is known to influence body size in mice and humans

http://www.dogbreedinfo.com/portuguesewaterdog.htm

16


In PWD, 15% of the phenotypic variance in skeletal size is explained by the IGF1 haplotype

In PWD, 96% of chromosomes carry only one of two haplotypes –B haplotype confirmed smaller skeletal size

http://www.dogbreedinfo.com/portuguesewaterdog.htm

17

Man

n-W

hit

ne

y U

(M

WU

) P

-val

ues

Association mapping of body weight in 14 small and 9 giant dog breeds identified IGF1

116 SNPs 83 SNPs (control chr)

IGF1


Yellow – ancestral allele (golden jackal)Blue – derived allele

The B haplotype at IGF1 is most associated with small body size


Strong linkage disequilibrium among variants prevented the identification of a causative variant

Small n = 14 Large n = 9

19

The IGF1 small dog haplotype is derived from Middle Eastern gray wolves

http://www.redorbit.com/education/reference_library/animal_kingdom/mammalia/2580178/southerneast_asian_wolf/

A few major QTL appear to control body size in dogs – result of selection for novelty and bottleneck 20



21

Human height is a classic polygenic trait with more than 80% of the variation within a given population estimated to be attributable to additive genetic factors

“Whenever a large sample of chaotic elements are taken in hand and marshaled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.” - Sir Francis Galton, Natural Inheritance, 1889, describing what is now known as the central limit theorem

http://terrytao.wordpress.com/2010/09/14/a-second-draft-of-a-non-technical-article-on-universality/n = 143

~165 cm (5.41 ft) for females ~178 cm (5.84 ft) for males

Visscher 2008. Nat. Genet. 40:489-490 22

Mutations that cause extreme stature are rare and highly unlikely to explain natural variation in human height

http://www.ibtimes.co.uk/worlds-tallest-man-meets-shortest-guinness-world-records-day-2014-makes-wacky-additions-almanac-1474652

251 cm (8.23 ft)

54.6 cm (1.79 ft)

23

A GWAS of height in a population of 183,727 individuals identified 180 SNP loci with small additive effects (stage 1)

Lango Allen et al. 2010. Nature 467:832 - 838

Enrichment of signals at genes in biological pathways and that underlie skeletal growth defects

FINGERSTURE 0.08 ± 0.02RS2 0.11 ± 0.01RS3 0.11 ± 0.01GOOD 0.09 ± 0.02QIMR 0.11 ± 0.02

Proportion of variance explained by 180 SNP loci in separate pops(stage 2)

24

Yang et al. 2010. Nat. Genet. 42: 565–569

A genome-wide prediction of height in a population of 3,935 individuals with 294,831 common SNPs that together explain 45-54% of phenotypic variance

Remaining heritability is unexplained likely from incomplete LD of SNPs with small effect, low MAF causal variants

(adjustment for prediction error- incomplete LD between SNPs and causal variants)

25

A GWAS meta-analysis of 78 height studies on 253,288 individuals identified sets of common SNPs (0.1-1% of total) that explained ~16-29% of phenotypic variance

Genetic architecture for height is defined by a very large finite number (thousands) of causal variants

Wood et al. 2014. Nat. Genet. 46:1173-1186

Validation in five independent populations excluded from meta‐analysis

All common SNPs explain even more!

26

Wood et al. 2014. Nat. Genet. 46:1173-1186

Genes at 423 loci tended to be highly expressed in tissues related to cartilage, joints and spine and other musculoskeletal, cardiovascular and endocrine tissues

Tissue enrichment combined with pruned gene set network 37,427 human microarray samples

Growth genes

27

1

Lecture 3: Basic Probability and

Statistical Tools Bruce Walsh Notes

Introduction to Plant Quantitative Genetics Tucson. 8-10 Jan 2018

I: Probability

2

3

Basic probability •  Events are possible outcomes from some

random process –  e.g., a genotype is AA, a phenotype is larger

than 20 •  Pr(E) denotes the probability of an event E •  Pr(E) is between zero and one •  The sum of the probabilities of all possible

nonoverlapping events is one. –  e.g, if the possible events are E1 , … , Ek, then –  Pr(E1) + … + Pr(Ek) = 1

4

The AND rule •  Consider two possible events, E1 and E2. •  If these are independent (knowledge that

one has occurred does not change the probability of the second), then the joint probability Pr(E1,E2), the Probability of E1 AND E2 is Pr(E1,E2), = Pr(E1)* Pr(E2),

•  Hence, with independence, AND = multiply •  Conditional probability is used when the

events are NOT independent

5

Example

•  Consider the cross AaBbCc X aaBbCc –  What is the probability of an aabbcc offspring? –  Assuming independent assortment (no linkage) –  = Pr(aa | Aa x aa) * Pr(bb | Bb x Bb) * Pr(cc | Cc x

Cc) = (1/2)(1/4)(1/4) = 1/32 •  How many offspring do we need to score to have a

90% probability of seeing at least one? –  Let p = 1/32. Prob(not seeing aabbcc in n

offspring) = (1-p)n. –  Prob(at least one) = 0.9 implies Prob(none) = 0.1 –  (1-p)n = 0.1, or n = log(0.1)/log(1-1/32) = 72.5

6

The OR rule •  Again, consider two possible events, E1 and E2. •  If these events are NONOVERLAPPING (they contain

no common elements), then Pr(E1 or E2) = Pr(E1) + Pr( E2)

•  Hence, OR = add •  Example:

–  What is the probability that a genotype is A-, i.e., that is AA or Aa?

–  The events genotype = AA and genotype = Aa are nonoverlapping

–  Hence, Pr(A-) = Pr(AA or Aa) = Pr(AA) + Pr(Aa)

7

Conditional Probability

•  It is ALWAYS true that –  Pr(A,B) = P(A|B)P(B) = P(B|A)P(A) –  P(A|B) is the conditional probability of A given B –  P(A) is the marginal probability of A –  P(A,B) is the joint probability of A and B –  If P(A|B) = P(A) for all possible B values, then A

and B are independent •  Note that

–  P(A|B) = P(A,B)/P(B)

8

Examples of Prob (cont) •  Recall that yellow peas (Y-) are dominant to green

peas (gg). Consider the F2 in a cross of YY x gg. –  What is the probability of a yellow F2 offspring?

•  Pr(yellow) = Pr(YY or Yg) = Pr(YY) + Pr(Yg) =1/4 + 1/2 = 3/4

–  What is the probability that a yellow F2 offspring is a YY homozygote?

•  Pr(YY | F2 Yellow) = Pr(YY and F2 Yellow)/Pr(F2 yellow) = (1/4)/(3/4) = 1/3.

9

Bayes’ Theorem Suppose an unobservable random variable (RV) takes on values b1 .. bn

Suppose that we observe the outcome A of an RV correlated with b. What can we say about b given A?

Bayes’ theorem:

A typical application in genetics is that A is some phenotype and b indexes some underlying (but unknown) genotype

Example •  You have an F2 plant that gives yellow peas

from a pure yellow (YY) x pure green (gg) cross (green recessive) – Hence, your plant is Y-, but could be YY or Yg. – You test-cross this plant by crossing to a gg

parent. –  If parent in YY, except only yellow. If Yg

expect 50:50 yellow/green

•  If you score 5 offspring and all are yellow, what is the probability this plant is YY?

10

•  You want to compute –  Pr(F2 yellow is YY | 5 yellow offspring)

•  First, you need your prior –  Prob(F2 yellow is YY) = 1/3 –  Prob(F2 yellow is Yg) = 2/3

•  Second, –  Prob(5 yellow offspring | YY) = 1 –  Prob(5 yellow offspring | Yg) = (1/2)5

•  Note that Pr(5 yellow offspring) = Prob(5 yellow offspring | YY) *Pr(YY) + Prob(5 yellow offspring | Yg) *Pr(Yg) = 1*(1/3) + (1/2)5 *(2/3) = 0.3542

•  From Bayes –  Pr(F2 yellow is YY | 5 yellow offspring)

•  = Prob(5 yellow offspring | YY) *Pr(YY) / Pr(5 yellow offspring) •  = 1*(1/3)/0.3542 = .941

11

12

Genotype QQ Qq qq

Freq(genotype) 0.5 0.3 0.2

Pr(height >70 | genotype) 0.3 0.6 0.9

Pr(height > 70) = 0.3*0.5 +0.6*0.3 + 0.9*0.2 = 0.51

Pr(QQ | height > 70) = Pr(QQ) * Pr (height > 70 | QQ)

Pr(height > 70)

= 0.5*0.3 / 0.51 = 0.294

Second example: Suppose height > 70. What is the probability individual is QQ? Qq? qq?

Suppose:

2. Probability distributions and random variables

13

14

A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution

A discrete RV x --- takes on values X1, X2, … Xk

Probability distribution: Pi = Pr(x = Xi)

Pi > 0, Σ Pi = 1

Discrete Random Variables

Probabilities are non-negative and sum to one

Example: Suppose the probability of seeing no individuals of genotype AABB in our sample is 0.1. What is the probability of seeing at least one? Pr(none) + Pr(at least one) = 1, hence Pr(at least one) = 1-Pr(none) = 0.9

15

The Binominal Distribution •  What is the expected number of successes in a series

of n trails where the probability p of success is the same for each trail?

•  This is given by the binominal distribution, –  Pr(k successes | n, p) = n!/[ (n-k)! k!] pk (1-p)n-k

•  Example: Suppose p = 0.05 and n = 10. What is the probability of seeing EXACTLY one success? –  Pr(k=1) = 10!/(9!*1!) 0.051 0.959 = 10* 0.051 0.959 = 0.315

•  What is the probability of seeing AT LEAST one success? –  Pr(k > 0) = 1 -Pr(k=0) = 1-(1-0.05)10 = 0.401

16

The Poisson Distribution •  Given that the expected number of successes in our

sample is λ, what is the probability that we see k successes?

•  This is given by the Poisson distribution –  Pr(k successes | λ) = e-λ λk/k!

•  Example: suppose λ = 0.5. –  Pr(k = 1) = e-0.50.51/1! = 0.303 –  Pr(at least one success) = 1- Pr(k = 0) = 1-e-0.5 = 0.393

•  Connection with binominal: λ = n*p –  Can either use Poisson as an approximation or when the

sample size n is not given

17

The geometric distribution •  Given success probability p per trail, how

many failures k occur before the first success?

•  This is a waiting-time (as opposed to a counting) problem, and is given by the geometric distribution –  Pr(k failures before a success) = (1-p)kp –  Example: Suppose p = 0.05. What is the

probability of AT LEAST one success in the first 10 trails?

–  = 1 - Pr(none in 1st 10) = 1-(1-p)10 = 0.401

A continuous RV x can take on any possible value in some interval (or set of intervals). The probability distribution is defined by the probability density function (or pdf), p(x)

Continuous Random Variables

Finally, the cdf, or cumulative probability function, is defined as cdf(z) = Pr( x < z)

19

Example: The normal (or Gaussian) distribution

Mean µ, variance σ2

Unit normal (mean 0, variance 1)

20

Mean (µ) = peak of distribution

The variance is a measure of spread about the mean. The smaller σ2, the narrower the distribution about the mean

3. Expectations and descriptive statistics

21

22

Expectations of Random Variables The expected value, E [f(x)], of some function f of the random variable x is just the average value of that function

E[x] = the (arithmetic) mean, µ, of a random variable x

23

Expectations of Random Variables E[ (x - µ)2 ] = σ 2, the variance of x

More generally, the rth moment about the mean is given by E[ (x - µ)r ] r = 2: variance (σ2)

r = 4: (scaled) kurtosis (3σ4 for a normal)

r = 3: skew (value is zero for a normal)

Useful properties of expectations

24

Covariances • Cov(x,y) = E [(x-µx)(y-µy)]

X

Y

cov(X,Y) > 0

Cov(x,y) > 0, positive (linear) association between x & y

•  = E [x*y] - E[x]*E[y]

25

Cov(x,y) < 0, negative (linear) association between x & y

X

Y

cov(X,Y) < 0

Cov(x,y) = 0, no linear association between x & y

X

Y

cov(X,Y) = 0

26

Cov(x,y) = 0 DOES NOT imply no association

X

Y

cov(X,Y) = 0

If x and y are independent, then cov(x,y) = 0

However, cov(x,y) = 0 DOES NOT imply that x and y are independent.

27

Correlation Cov = 10 tells us nothing about the strength of an association

What is needed is an absolute measure of association

This is provided by the correlation, r(x,y)

r = 1 implies a perfect (positive) linear association

r = - 1 implies a perfect (negative) linear association

28

Useful Properties of Variances and

Covariances •  Symmetry, Cov(x,y) = Cov(y,x) •  The covariance of a variable with itself is the

variance, Cov(x,x) = Var(x) •  If a is a constant, then

–  Cov(ax,y) = a Cov(x,y) •  Var(a x) = a2 Var(x).

–  Var(ax) = Cov(ax,ax) = a2 Cov(x,x) = a2Var(x)

•  Cov(x+y,z) = Cov(x,z) + Cov(y,z)

29

Hence, the variance of a sum equals the sum of the Variances ONLY when the elements are uncorrelated

More generally

Question: What is Var(x-y)?

30

Regressions Consider the best (linear) predictor of y given we know x

The slope of this linear regression is a function of Cov,

The fraction of the variation in y accounted for by knowing x, i.e,Var(yhat - y), is Var(y) [1-r2]

31

In this case, the fraction of variation accounted for by the regression is b2

Relationship between the correlation and the regression slope:

If Var(x) = Var(y), then by|x = b x|y = r(x,y)

32

r 2 = 0.6

r 2 = 0.9 r 2 = 1.0

r 2 = 0.3

33

Properties of Least-squares Regressions The slope and intercept obtained by least-squares: minimize the sum of squared residuals:

• The average value of the residual is zero

• The LS solution maximizes the amount of variation in y that can be explained by a linear regression on x

• The residual errors around the least-squares regression are uncorrelated with the predictor variable x

• Fraction of variance in y accounted by the regression is r2

• Homoscedastic vs. heteroscedastic residual variances

4. Different methods of statistical analysis

34

35

Different methods of analysis •  Parameters of these various models can be

estimated in a number of frameworks •  Method of moments

–  Very little assumptions about the underlying distribution. Typically, the mean of some statistic has an expected value of the parameter

–  Example: Estimate of the mean µ given by the sample mean, xbar, as E(xbar) = µ.

–  While estimation does not require distribution assumptions, confidence intervals and hypothesis testing do

•  Distribution-based estimation –  The explicit form of the distribution used

36

Distribution-based estimation •  Maximum likelihood estimation

–  MLE –  REML –  More in Lynch & Walsh (book) Appendix 3

•  Bayesian –  More in Walsh & Lynch (online chapters = Vol 2)

Appendices 2,3

37

Maximum Likelihood p(x1,…, xn | θ ) = density of the observed data (x1,…, xn) given the (unknown) distribution parameter(s) θ

Fisher suggested the method of maximum likelihood given the data (x1,…, xn) find the value(s) of θ that maximize p(x1,…, xn | θ )

We usually express p(x1,…, xn | θ) as a likelihood function l ( θ | x1,…, xn ) to remind us that it is dependent on the observed data

The Maximum Likelihood Estimator (MLE) of θ are the value(s) that maximize the likelihood function l given the observed data x1,…, xn .

38

l (θ | x)

MLE of θ

The curvature of the likelihood surface in the neighborhood of the MLE informs us as to the precision of the estimator. A narrow peak = high precision. A board peak = low precision

This is formalized by looking at the log-likelihood surface, L = ln [l (θ | x) ]. Since ln is a monotonic function, the value of θ that maximizes l also maximizes L

The larger the curvature, the smaller the variance

θ

39

Likelihood Ratio tests Hypothesis testing in the ML frameworks occurs through likelihood-ratio (LR) tests

For large sample sizes (generally) LR approaches a Chi-square distribution with r df (r = number of parameters assigned fixed values under null)

θr is the MLE under the restricted conditions (some parameters specified, e.g., var =1)

Θ is the MLE under the unrestricted conditions (no parameters specified)

40

Bayesian Statistics An extension of likelihood is Bayesian statistics

p(θ | x) = C * l(x | θ) p(θ)

Instead of simply estimating a point estimate (e.g., the MLE), the goal is the estimate the entire distribution for the unknown parameter θ given the data x

p(θ | x) is the posterior distribution for θ given the data x

l(x | θ) is just the likelihood function

p(θ) is the prior distribution on θ.

41

Bayesian Statistics Why Bayesian?

• Exact for any sample size

• Marginal posteriors

• Efficient use of any prior information

• MCMC (such as Gibbs sampling) methods

Priors quantify the strength of any prior information. Often these are taken to be diffuse (with a high variance), so prior weights on θ spread over a wide range of possible values.

42

p values in Hypothesis testing •  The p value of a test statistic is the

probability of seeing a value as large (or larger) under the null hypothesis

•  For example, suppose you are assuming a random variable comes from a normal with mean zero and variance one. –  The probability of seeing a value more extreme

than 2 (i.e., greater than two or less than -2) is 0.0455, the p value associated with this value of the test statistic.

43

Significance and multiple comparisons •  One could either report a p value or have some criteria (i.e., any

test with a p value less than 0.01) that declares a test to be significant (and hence a positive result) –  p is the probability of a false positive, the probability of

declaring a test under the null as being significant. •  The problem of multiple comparisons arises when a large

number of tests are performed. –  Suppose our significance threshold is p = 0.005, but 1000

tests are done. Under the null, we still expect 0.005*1000 = 5 significant tests

–  Bonferroni corrections are done by first setting a significance level for the entire COLLECTION of tests (say π = 0.05). To have this level experiment-wide control of false positives requires each test uses p = π /n

•  For n = 1000, an experiment-wide false positive rate (probability) of 0.05 declares significance only with the p value for a test is less than 0.05/1000 = 0.00005.

44

Power and Type I/II errors •  A Type I error is the probability of declaring a

test to be significant when the null is true (a false positive)

•  The power of a statistical test (a function of the sample size and the true parameters) is the probability of declaring a test to be significant when the null is false. –  A Type II error occurs when we fail to declare a

test significant when it is not from the null (i.e., a false negative)

45

FDR, the false discovery rate •  p is the probability of declaring a test under the null

to be significant (the false-positive rate) •  When many tests are expected to be significant (i.e.,

looking for differences in expression over a large number of genes), a more appropriate measure is the false discovery rate (or FDR), the number of false positives among all tests declared to be significant. –  Example: Suppose 1000 tests with a significant threshold of

p = 0.005 is used. Expect 5 false positives, but suppose that 30 significant tests are found. Here the FDR = 5/30 = 0.167.

–  Hence, 16.7% of the positive tests are false positives

1

Lecture 4: Allelic Effects and Genetic

Variances



2

Quantitative Genetics

The analysis of traits whose variation is determined by both a

number of genes and environmental factors

Phenotype is highly uninformative as to underlying genotype

3

Complex (or Quantitative) trait •  No (apparent) simple Mendelian basis for

variation in the trait •  May be a single gene strongly influenced by

environmental factors •  May be the result of a number of genes of

equal (or differing) effect •  Most likely, a combination of both multiple

genes and environmental factors •  Example: Blood pressure, cholesterol levels

–  Known genetic and environmental risk factors •  Molecular traits can also be quantitative traits

–  mRNA level on a microarray analysis –  Protein spot volume on a 2-D gel

4

Phenotypic distribution of a trait

5

Consider a specific locus influencing the trait

For this locus, mean phenotype = 0.15, while overall mean phenotype = 0

Hence, it is very hard to distinguish the QQ individuals from all others simply from their phenotypic values

Values for QQ individuals shaded in dark green

6

Basic model of Quantitative Genetics

Basic model: P = G + E

Phenotypic value -- we will occasionally also use z for this value

Genotypic value

Environmental value

G = average phenotypic value for that genotype if we are able to replicate it over the universe of environmental values, G = E[P]

Hence, genotypic values are functions of the environments experienced.

7

Basic model of Quantitative Genetics Basic model: P = G + E

G = average phenotypic value for that genotype if we are able to replicate it over the universe of environmental values, G = E[P]

G x E interaction --- The performance of a particular genotype in a particular environment differs from the sum of the average performance of that genotype over all environments and the average performance of that environment over all genotypes. Basic model now becomes P = G + E + GE

G = average value of an inbred line over a series of environments

8

East (1911) data on US maize

crosses

9 Each sample (P1, P2, F1) has same G, all variation in P is due to variation in E

Same G, Var(P) = Var(E)

10

All same G, hence Var(P) = Var(E)

Variation in G Var(P) = Var(G) + Var(E)

Var(F2) > Var(F1) due to Variation in G

Johannsen (1903) bean data

•  Johannsen had a series of fully inbred (= pure) lines.

•  There was a consistent between-line difference in the mean bean size – Differences in G across lines

• However, within a given line, size of parental seed independent of size of offspring speed – No variation in G within a line

11

12

13

Goals of Quantitative Genetics •  Partition total trait variation into genetic (nature) vs.

environmental (nurture) components •  Predict resemblance between relatives

–  If a sib has a disease/trait, what are your odds? –  Selection response –  Change in mean under inbreeding, outcrossing, assortative

mating •  Find the underlying loci contributing to genetic

variation –  QTL -- quantitative trait loci

•  Deduce molecular basis for genetic trait variation •  eQTLs -- expression QTLs, loci with a quantitative

influence on gene expression –  e.g., QTLs influencing mRNA abundance on a microarray

14

The transmission of genotypes versus alleles

•  With fully inbred lines, offspring have the same genotype as their parent, and hence the entire parental genotypic value G is passed along –  Hence, favorable interactions between alleles (such as with

dominance) are not lost by randomization under random mating but rather passed along.

•  When offspring are generated by crossing (or random mating), each parent contributes a single allele at each locus to its offspring, and hence only passes along a PART of its genotypic value

•  This part is determined by the average effect of the allele –  Downside is that favorable interaction between alleles are NOT

passed along to their offspring in a diploid (but, as we will see, are in an autoteraploid)

15

Genotypic values It will prove very useful to decompose the genotypic value into the difference between homozygotes (2a) and a measure of dominance (d or k = d/a)

aa Aa AA

C - a C + d C + a

Note that the constant C is the average value of the two homozygotes.

If no dominance, d = 0, as heterozygote value equals the average of the two parents. Can also write d = ka, so that G(Aa) = C + a(1 + k)

16

Computing a and d

Genotype aa Aa AA

Trait value 10 15 16

Suppose a major locus influences plant height, with the following values

C = [G(AA) + G(aa)]/2 = (16+10)/2 = 13 a = [G(AA) - G(aa)]/2 = (16-10)/2 = 3 d = G(Aa)] - [G(AA) + G(aa)]/2 = G(Aa)] - C = 15 - 13 = 2

17

Population means: Random mating Let p = freq(A), q = 1-p = freq(a). Assuming random-mating (Hardy-Weinberg frequencies),

Genotype aa Aa AA

Value C - a C + d C + a

Frequency q2 2pq p2

Mean = q2(C - a) + 2pq(C + d) + p2(C + a) µRM = C + a(p2-q2) + d(2pq)

Contribution from homozygotes

Contribution from heterozygotes

18

Population means: Inbred cross F2 Suppose two inbred lines are crossed. If A is fixed in one line and a in the other, then p = q = 1/2

Genotype aa Aa AA


Frequency 1/4 1/2 1/4

Mean = (1/4)(C - a) + (1/2)(C + d) + (1/4)( C + a) µRM = C + d/2

Note that C is the average of the two parental lines, so when d > 0, F2 exceeds this (heterosis). Note also that the F1 exceeds this average by d, so only half of this passed onto F2.

19

Population means: RILs from an F2 A large number of F2 individuals are fully inbred, either by selfing for many generations or by generating doubled haploids. If p and q denote the F2 frequencies of A and a, what is the expected mean over the set of resulting RILS?

Genotype aa Aa AA


Frequency q 0 p

µRILs = C + a(p-q)

Note this is independent of the amount of dominance (d)

20

The average effect of an allele

•  The average effect αA of an allele A is defined by the difference between offspring that gets that allele and a random offspring. –  αA = mean(offspring value given parent transmits

A) - mean(all offspring) –  Similar definition for αa.

•  Note that while C, a and d (the genotypic parameters) do not change with allele frequency, αx is clearly a function of the frequencies of alleles with which allele x combines.

21

Random mating Consider the average effect of allele A when a parent is randomly- mated to another individual from its population

Allele from other parent

Probability Genotype Value

A p AA C + a

a q Aa C + d

Suppose parent contributes A

Mean(A transmitted) = p(C + a) + q(C + d) = C + pa + qd

αA = Mean(A transmitted) - µ = q[a + d(q-p)]

22

Random mating



A p Aa C + d

a q aa C - a

Now suppose parent contributes a

Mean(a transmitted) = p(C + d) + q(C - a) = C - qa + pd

αa = Mean(a transmitted) - µ = -p[a + d(q-p)]

23

α, the average effect of an allelic substitution

•  α = αA - αa is the average effect of an allelic substitution, the change in mean trait value when an a allele in a random individual is replaced by an A allele –  α = a + d(q-p). Note that

•  αA = qα and αa =-pα. • E(αX) = pαA + qαa = pqα - qpα = 0, • The average effect of a random allele is zero,

hence average effects are deviations from the mean

24

Dominance deviations •  Fisher (1918) decomposed the contribution

to the genotypic value from a single locus as Gij = µ + αi + αj + δij –  Here, µ is the mean (a function of p) –  αi are the average effects –  Hence, µ + αi + αj is the predicted genotypic

value given the average effect (over all genotypes) of alleles i and j.

–  The dominance deviation associated with genotype Gij is the difference between its true value and its value predicted from the sum of average effects (essentially a residual)

25

Fisher’s (1918) Decomposition of G One of Fisher’s key insights was that the genotypic value consists of a fraction that can be passed from parent to offspring and a fraction that cannot.

Mean value µG = Σ Gij Freq(AiAj)

Average contribution to genotypic value for allele i

Consider the genotypic value Gij resulting from an AiAj individual

In particular, under sexual reproduction, parents only pass along SINGLE ALLELES to their offspring

Gij = µG + αi + αj + δij

26

Since parents pass along single alleles to their offspring, the αi (the average effect of allele i) represent these contributions

The genotypic value predicted from the individual allelic effects is thus

The average effect for an allele is POPULATION- SPECIFIC, as it depends on the types and frequencies of alleles that it pairs with


Gij = µG + αi + αj ^

27

Dominance deviations --- the difference (for genotype AiAj) between the genotypic value predicted from the two single alleles and the actual genotypic value,


The genotypic value predicted from the individual allelic effects is thus Gij = µG + αi + αj

^

Gij - Gij = δij ^

28

Gen

otyp

ic V

alue

N = # Copies of Allele 2 0 1 2

G11

G21

G22

µ + 2α1

µ + α1 + α2

µ + 2α2

δ12

δ11

δ22

Slope = α = α2 - α1

1

α

11 21 22 Genotypes

29

Average Effects and Additive Genetic Values

A ( G ij ) = αi +

The α values are the average effects of an allele

A key concept is the Additive Genetic Value (A) of an individual

A is called the Breeding value or the Additive genetic value

αi(k) = effect of allele i at locus k

A ( G ij ) = αi + αj

30

Why all the fuss over A?

Suppose pollen parent has A = 10 and seed parent has A = -2 for plant height

Expected average offspring height is (10-2)/2 = 4 units above the population mean. Offspring A = average of parental A’s

KEY: parents only pass single alleles to their offspring. Hence, they only pass along the A part of their genotypic value G

31

Genetic Variances Writing the genotypic value as

The genetic variance can be written as

This follows since

Gij = µG + (αi + αj) + δij

As Cov(α,δ) = 0

32

Genetic Variances

σ2 G =

2 A +

2 D

Additive Genetic Variance (or simply Additive Variance) Dominance Genetic Variance

(or simply dominance variance)

Hence, total genetic variance = additive + dominance variances,

σ σ

33

Key concepts (so far) •  αi = average effect of allele i

–  Property of a single allele in a particular population (depends on genetic background)

•  A = Additive Genetic Value (A) –  A = sum (over all loci) of average effects –  Fraction of G that parents pass along to their offspring –  Property of an Individual in a particular population

•  Var(A) = additive genetic variance –  Variance in additive genetic values –  Property of a population

•  Can estimate A or Var(A) without knowing any of the underlying genetical detail (forthcoming)

34

One locus, 2 alleles:

Q1Q1 Q1Q2 Q2Q2

0 a(1+k) 2a

When dominance present, Additive variance is an asymmetric function of allele frequencies

Since E[α] = 0, Var(α) = E[(α -µa)2] = E[α2]

35

Q1Q1 Q1Q2 Q2Q2

0 a(1+k) 2a

This is a symmetric function of allele frequencies

Dominance variance

Can also be expressed in terms of d = ak

36

Additive variance, VA, with no dominance (k = 0)

Allele frequency, p

VA

37

Complete dominance (k = 1)

Allele frequency, p

VA

VD

38

Epistasis

These components are defined to be uncorrelated, (or orthogonal), so that

39

Additive x Additive interactions -- αα, AA interactions between a single allele at one locus with a single allele at another

Additive x Dominance interactions -- αδ, AD interactions between an allele at one locus with the genotype at another, e.g. allele Ai and genotype Bkj

Dominance x dominance interaction --- δδ, DD the interaction between the dominance deviation at one locus with the dominance deviation at another.

40

Effects and Variance when using a testor

•  A common design in plant breeding is to cross members from a population to a testor to generate a testcross. –  Testor can be either an inbred or an outcrossing population –  Often from a different heteroic group from the population

being tested –  Often testor is an elite genotype

•  The average effect of an allele in a testcross, its variance, and its additive (General combining ability, GCA) and interaction (Specific combining ability, SCA) effects all follow in analogous fashion to previous results for crosses within a population

41

•  The concept of the average effect of an allele when crossed within its population is easily extended to the average effect of an allele when crossed to a testor. –  Called the testcross average effect.

•  The average effect of allele X in this testcross, αxT , is

defined as difference between the mean value of offspring getting this allele from the population versus the mean value of a random offspring from this cross –  Will turn out to be a function of the frequencies of alleles in

both the tested and the testor population.

The average effect of an allele in a testcross

42

Mean value for a testcross Suppose the frequency of A is p in the population and pT in the testor (with q and qT similarly defined for a).

A (pT) a (qT)

A (p) ppT C + a

pqT C + d

a (q) qpT C + d

qqT

C - a

testor

Pare

ntal

line

Mean of cross = C + a(ppT - qqT) + d(pqT + qpT)

43

Average testcross mean in a series of RILs

•  Slide 9 gave an expression for the expected average performance from a series of RILs formed by crossing two populations.

•  A similar expression exists for the average testcross performance for a series of RILs from a cross of A x B –  Mean = (1/2) µA

T + (1/2) µBT, namely the average of the

testcross means for A and B –  More generally (since lines can, by chance, have equal

contribution of alleles), –  Mean = πA µA

T + πB µBT, where πA = (1- πB) is the fraction of

alleles from A in the sample if RILS

–  Can use molecular markers to estimate the πx directly.

44

αAT, testcross effect of allele A



A pT AA C + a

a qT Aa C + d

Suppose parent contributes A

Mean(A transmitted) = pT(C + a) + qT(C + d) = C + pTa + qTd

αAT = Mean(A transmitted) - µ = q[a + d(qT-pT)]

αaT = Mean(a transmitted) - µ = -p[a + d(qT-pT)]

Likewise,

45

αT, the average testcross effect of an allelic substitution

•  αT = αAT - αa

T is the average testcross effect of an allelic substitution, the change in mean trait value when an a allele in a random testcrossed individual is replaced by an A allele –  αT = a + d(qT-pT). Note that this is

independent of the allele frequencies in the parental population, and depends ONLY on the testor allele frequencies.

•  αAT = qαT , αa

T = -pαT, and E(αxT) = 0

46

Testcross variance •  Just as the additive genetic variance was the population

variance in the sum of the average effects of an allele, the testcross variance is variance in the average testcross effects of a random allele –  Var(AT) = Var(αx

T) = Var(αxT)

–  Var(αxT) = p (αA

T)2 + q (αaT)2 =

–  p(q[a + d(qT-pT)])2 + q(-p[a + d(qT-pT)])2 •  = pq[a + d(qT-pT)]2

– Hence, Var(αxT) = pq[a + d(qT-pT)]2

47

GCS and SCA •  Consider a cross between individuals from

population 1 and population 2 •  Let µ1 x 2 denote the average value for all of

these crosses, and let Gij be the average genotypic value of an individual from a cross from individual i (or line) in population one and individual j (or line) from population two.

•  Analogous to Fisher’s decomposition, we can write this in terms of two additive effects and one interaction effect.

48

αi2 is the testcross average effect for allele i (more

generally an allele from individual i) when tested using population 2 as a testor, with αj

1 similarly defined for allele j (from pop 2) using one as the testor

is the interaction between allele i from and allele j in the testcross of 1 and 2

The sum over all loci of the αi2 values is the general

combining ability (GCA) of line i when crossed to line 2 (note these are cross-specific)!

The sum of the δ is the specific combining ability (SCA)

49

Gij = µ + GCAi2 + GCAj

1 + SCAij

12 The superscripts denoting the population in which the allele is being tested is often suppressed

The GCA is akin to the breeding value from one parent, but now it is the testcross value of that parent

The predicted mean of a particular cross is the sum of the two GCAs for those individuals/lines

As with average effects and dominance deviations, these are only defined with respect to a particular reference set of crosses (I.e., lines from Pop 1 X lines from pop 2)

50

Within-population crosses vs. testors

Within-pop testor

Allelic effects α αT

Additive transmitting factor Breeding value A GCA

Predicting offspring mean A1/2 +A2/2 GCA1 + GCA2

Nonadditive component Dominance value SCA

Genetic Variances Var(A), Var(D) Var(GCA), Var(SCA)

1

Lecture 5: Resemblance Between

Relatives Bruce Walsh Notes

Introduction to Plant Quantitative Genetics Tucson. 8-10 Jan 2018

2

Heritability •  Central concept in quantitative genetics •  Fraction of phenotypic variance due to

additive genetic values (Breeding values) –  h2 = VA/VP

–  This is called the narrow-sense heritability –  Phenotypes (and hence VP) can be directly

measured –  Breeding values (and hence VA) must be

estimated •  Estimates of VA require known collections of

relatives

3

Broad-sense heritability

•  Narrow-sense heritability h2 applies when outcrossing, –  h2 = Var(A)/Var(P) –  = the fraction of all trait variation due to variation

in breeding (additive genetic) values •  Broad-sense heritability H2 applies when

selecting among a series of pure lines –  H2 = Var(G)/Var(P) –  = the fraction of all trait variation due to variation

in Genotypic values

4

Defining H2 for Plant Populations Plant breeders often do not measure individual plants (especially with pure lines), but instead measure a plot or a block of individuals.

This replication can result in inconsistent measures of H2 even for otherwise identical populations.

Effect of the k-th plot deviations of individual plants within this plot

Let zijkl denote the value of the l-th replicate in plot k of genotype i in environment j. We can decompose this value as

zijkl = Gi + Ej + GEij + pijk + eijkl

5

If we set our unit of measurement as the average over all plots, the phenotypic variance for the mean of line i becomes

Thus, VP, and H2 = VG/VP, depend on our choice of e, r, and n

σ2 ( ) = σ2 G + σ2 E + σ2 G E e + σ2 p e r +

σ2 e e r n

Suppose we replicate the genotype over e environments, with r plots (replicates) per environment, and n individuals per plot.

In order to compare board-sense heritabilities we need to use a consistent design (same values of e, r, and n)

zi

6

Key observations •  The amount of phenotypic resemblance

among relatives for the trait provides an indication of the amount of genetic variation for the trait.

•  If trait variation has a significant genetic basis, the closer the relatives, the more similar their appearance

•  The covariance between the phenotypic value of relatives measures the strength of this similarity, with larger Cov = more similarity

7

Genetic Covariance between relatives

Genetic covariances arise because two related individuals are more likely to share alleles than are two unrelated individuals.

Sharing alleles means having alleles that are identical by descent (IBD): both copies can be traced back to a single copy in a recent common ancestor.

Father Mother

8

Father Mother

No alleles IBD One allele IBD

Both alleles IBD

9

ANOVA: Analysis of variation •  Partitioning of trait variance into within- and among

-group components •  Two key ANOVA identities

–  Total variance = between-group variance + within-group variance

•  Var(T) = Var(B) + Var(W)

–  Variance(between groups) = covariance (within groups)

–  Intraclass correlation, t = Var(B)/Var(T) •  The more similar individuals are within a group (higher within

-group covariance), the larger their between-group differences (variance in the group means)

10

4 3 2 1 4 3 2 1

Situation 1

Var(B) = 2.5 Var(W) = 0.2 Var(T) = 2.7

Situation 2

Var(B) = 0 Var(W) = 2.7 Var(T) = 2.7

t = 2.5/2.7 = 0.93 t = 0

11

Why cov(within) = variance(among)? •  Let zij denote the jth member of group i.

–  Here zij = u + gi + eij –  gi is the group effect –  eij the residual error

•  Covariance within a group Cov(zij,zik ) –  = Cov(u + gi + eij, u + gi + eik) –  = Cov(gi, gi) as all other terms are uncorrelated –  Cov(gi, gi) = Var(g) is the among-group variance

12

Resemblance between relatives and variance components

•  The phenotypic covariance between relatives can be expressed in terms of genetic variance components –  Cov(zx,zy) = axyVA + bxyVD. –  The weights a and b depend on the nature of the

relatives x and y, and are measures of how often they are expected to share alleles identical by descent

–  These are critical in predicting selection response

13

Parent-offspring genetic covariance Cov(Gp, Go) --- Parents and offspring share EXACTLY one allele IBD

Denote this common allele by A1

G p = A p + D p = α1 + αx + D 1 x

G o = A o + D o = α1 + αy + D 1 y

IBD allele Non-IBD alleles

15

Hence, relatives sharing one allele IBD have a genetic covariance of Var(A)/2

The resulting parent-offspring genetic covariance becomes Cov(Gp,Go) = Var(A)/2

16

Half-sibs

1

o 1

2

o 2

The half-sibs share no alleles IBD • occurs with probability 1/2

Each sib gets exactly one allele from common father, different alleles from the different mothers

Hence, the genetic covariance of half-sibs is just (1/2)Var(A)/2 = Var(A)/4

17

Full-sibs Father Mother

Sib 1

Prob(Allele from father IBD) = 1/2. Given the allele in parent one, prob = 1/2 that sib 2 gets same allele

Each sib gets exact one allele from each parent

Sib 2

Prob(Allele from father not IBD) = 1/2. Given the allele in parent one, prob = 1/2 that sib 2 gets different allele

18

Full-sibs Father Mother

Full Sibs Paternal allele not IBD [ Prob = 1/2 ] Maternal allele not IBD [ Prob = 1/2 ] Prob(sibs share 0 alleles IBD) = 1/2*1/2 = 1/4


19

Father Mother

Full Sibs

Paternal allele IBD [ Prob = 1/2 ] Maternal allele IBD [ Prob = 1/2 ] Prob(sibs share 2 alleles IBD) = 1/2*1/2 = 1/4


Prob(share 1 allele IBD) = 1-Pr(0) - Pr(2) = 1/2

20

I BD al l el es P rob a bil i ty Co n tr i but i on

0 1/ 4 0

1 1/ 2 V a r ( A ) / 2

2 1/ 4 V a r ( A ) + Va r( D )

Resulting Genetic Covariance between full-sibs

Cov(Full-sibs) = Var(A)/2 + Var(D)/4

21

Genetic Covariances for General Relatives

Let r = (1/2)Prob(1 allele IBD) + Prob(2 alleles IBD)

Let u = Prob(both alleles IBD)

General genetic covariance between relatives Cov(G) = rVar(A) + uVar(D)

When epistasis is present, additional terms appear r2Var(AA) + ruVar(AD) + u2Var(DD) + r3Var(AAA) +

22

More general relationships

•  To obtain the expected covariance for any set of relatives, we normally need only compute r and u for that set of relatives

•  With general inbreeding, becomes more complex (as three other terms, in addition to VA and VD arise --- not discussed here, see WL chapter 11 for details)

•  With crosses involving inbred and/or related parents, values for r and u are different from those presented above.

23

Coefficients of Coancestry Suppose we pick a single allele each at random from two relatives. The probability that these are IBD is called Θ, the coefficient of coancestry

Θxy denotes the coefficient for relatives x and y

Consider an offspring z from a (hypothetical) cross of x and y. Θxy = fz, the inbreeding coefficient of z. Why? Because the offspring of x and y each get a randomly-chosen allele from each parent. The probability fz that both alleles are IBD (the probability of inbreeding) is thus just Θxy.

24

θ and the coefficient on VA •  The coefficient on the additive variance for

the relatives x and y is just 2θxy. •  To see this,

–  let AiAj denote the two alleles in x and AkAl those in y.

–  Cov(breeding values) = Pr(Ai ibd Ak)cov(αi, αk) + Pr(Ai ibd Al)cov(αi,αl) + Pr(Aj ibd Ak)cov(αj, αk) + Pr(Aj ibd Al)cov(αj,αl) = 4 θxyVar(α)

–  Since Var(A) = 2Var(α), Cov = 2 θxyVar(A)

25

Θxx : The Coancestry of an individual with itself

Self x, what is the inbreeding coefficient of its offspring?

To compute Θxx, denote the two alleles in x by A1 and A2

Draw A1

Draw A1 Draw A2

Draw A2

IBD

IBD

Hence, for a non-inbred individual, Θxx = 2/4 = 1/2

If x is inbred, fx = prob A1 and A2 IBD,

fx

fx

Θxx = (1+ fx)/2

26

Example B A D C

E F

G

Consider the following pedigree Suppose A and D are fully-inbred, and related, lines with θAD = 0.5. Further, B and C are unrelated and outcrossed individuals

Individual A B C D

Fx 1 0 0 1

θxx = (1 + Fx)/2 1 1/2 1/2 1

27

The Parent-offspring Coancestry Let A1, An denote the two alleles in the offspring, where An is the allele from the nonfocal parent (NP), while A1,Ap are the two alleles in the focal parent (P)

Draw A1

Draw A1 Draw An

Draw Ap

IBD

ΘP,NP

For a non-inbred individual, ΘP0 = 1/4

fp

ΘPO = (1 + fp + 2ΘP,NP)/4 = (1 + fp + 2fo)/4

Offspring

Pare

nt

A1, Ap IDB if parent is inbred

Prob(An,Ap), the alleles from the two parents are IBD, i.e. , offspring is inbred

ΘP,NP

General:

28

B A D C

E F

G

From before

θAA= θDD = 1; θBB = θCC = 1/2; θAD = 1/2, θAB = θAC = θBC = θBD = θCD = 0

Consider A - E (inbred parent - offspring) θAE = (1+fA)/4 = (1+1)/4 = 1/2. Same value for θDF

Consider B - E (outbred parent - offspring) θBE = (1+fB)/4 = (1+0)/4 = 1/4. Same value for θCF

Consider E - G (outbred parent - offspring) θEG = (1+fE+2θEF)/4 = (1+0+2[1/8])/4 = 5/16. Same value for θFG

29

B A D C

E F

G

From before

θAA= θDD = 1; θBB = θCC = 1/2; θAD = 1/2, θAB = θAC = θBC = θBD = θCD = 0

What about θEF ?

The randomly-chosen allele from E has equal chance of being from A or B. Likewise for F (from C or D)

Of these four possible combinations (A&C, A&D, B&C, B&D), only an allele from A and an allele from D have a chance of being IBD, which is θAD = 1/2.

Hence, θEF = θAD /4 = 1/8

30

m f

1/2 1/2

(1/2)(1/2)(1/2) (1/2)(1/2)(1/2)

Θ = 1/8 + 1/8 = 1/4

m f

(1+fm)/2 (1+ff)/2

[(1 +fm )/2] (1/2)(1/2) [(1 +ff )/2] (1/2)(1/2)

Θ =(2 + fm+ ff)/8

Full sibs (x and y) from parents m and f

Unrelated, non-inbred parents

Unrelated, inbred parents

31

m f

Θ mf

Θ mf /4


m f

Θ mf

Θ mf (1/2)(1/2)

This gives Θ = (2+fm+ff +4Θ mf)/8 = (2+fm+ff +4fo)/8

Parents inbred & related. Two additional paths to add to Θ =(2+fm+ff)/8

32


Θxy = (2 + fm + ff + 4Θmf)/8

f m

x y

s f d f s m d m

ff = Θsf,df fm = Θsm,dm

Θxy = (2 + Θsm,dm + Θsf,df + 4Θmf)/8

Putting all this together gives

33

B A D C

E F

From before

θAA= θDD = 1; θBB = θCC = 1/2; θAD = 1/2, θEF = 1/8, θAB = θAC = θBC = θBD = θCD = 0

S1,S2

θS1S2 = (2 + 0 + 0 + 4[1/8])/8 = (4 + 1)/16 = 5/16

Θxy = (2 + ΘAB + ΘCD + 4ΘEF)/8

Example

34

Half-sibs

• Using the same arguments as above, θEF = (θAA + θAB + θAC + θBC)/4 = ([1 + fA]/2 + θAB + θAC + θBC)/4 Hence, if B and C unrelated, θEF = (1 + fA)/8

A B C

E F

A is the common parent

35

Computing θxy -- The Recursive Method •  There is a simple recursive method for generating the elements Aij

= 2 θij of a relationship matrix (used for BLUP selection). For ease of reading, we use the notation A(i,j) = Aij –  Basic idea is that the founding individuals of the pedigree are

assumed to be unrelated and not inbred (although this can also be accommodated). These founders are assigned values of A(i,i) = 1.

–  Likewise, any unknown parent of any future individual is assumed to be unrelated to all others in the pedigree and not inbred, and they are also assigned a value of A(i,i) = 1.

–  Let Si and Di denote the sire and dam (father and mother) of individual i. For this offspring A(i,i) = 1 + A(Si, Di)/2

–  A(i,j) = A(j,i) = [A(j,Si) + A(j,Di)]/2 = [A(i,Sj) + A(i,Dj)]/2 –  The recursive (or tabular) method starts with the founding parents and

then proceeds down the pedigree in a recursive fashion to fill out A for the desired pedigree.

36

Example

1

2 3 4 5

6 7 8

9 10

11

Ancestors are 1 & 2

A(1,1) = A(2,2) = 1 A(1,2) = 0

3: S3 = 1, D3 = Unknown, A(3,3) = 1 + A(S3,D3)/2 = 1 + A(1,unk)/2 = 1 A(1,3) = [A(1,S3) + A(1,D3)]/2 = [A(1,1) + A(1,unk)]/2 = 1/2. Note also that A(1,4) = A(1,5) = 1/2, A(4,4) = A(5,5) = 1. A(3,4) = [A(3,S4) + A(3,D4) ]/2 = [A(3,1) + A(3,unk)]/2 = (1/2+0)/2 = 1/4. Same for A(3,5) = 1/4. 2 is unrelated to 3, 4, 5, giving A(2,3) = A(2,4) = A(2,5) = 0.

3, 4, 5, 8 all have unknown parents (only a single arrow to them)

37

1

2 3 4 5

6 7 8

9 10

11

So far

6: S6 = 2, D6 = 3. A(6,6) = 1 + A(S6, D6)/2 = 1 + A(2,3)/2 = 1 A(6,1) = [A(1, S6) + A(1, D6)]/2 = [A(1,2) + A(1,3)]/2 = [0 + 1/2]/2 = 1/4 A(6,2) = [A(2, S6) + A(2, D6)]/2 = [A(2,2) + A(2,3)]/2 = [1+ 0]/2 = 1/2 A(6,3) = [A(3, S6) + A(3, D6)]/2 = [A(3,2) + A(3,3)]/2 = [0 + 1]/2 = 1/2 A(6,4) = [A(4, S6) + A(4, D6)]/2 = [A(4,2) + A(4,3)]/2 = [0 + 1/4]/2 = 1/8 A(6,5) = [A(5, S6) + A(5, D6)]/2 = [A(5,2) + A(5,3)]/2 = (0+1/4)/2 = 1/8

7: S7 = 2, D7 = 4. A(7,7) = 1 + A(S7, D7)/2 = 1 + A(2,4)/2 = 1 + 0/2 = 1 A(6,7) = [A(6, S7) + A(6, D7)]/2 = [A(6, 2) + A(6, 4)]/2 = (1/2 +1/8)/2 = 5/16

8: S8 = 5, D8 = unk. A(8,8) = 1 + A(S8, D8)/2 = 1 + A(5,unk)/2 = 1. A(6,8) = [A(6, S8) + A(6, D8)]/2 = [A(6, 5) + A(6, unk)]/2 = (1/8)/2 = 1/16

9: S9 = 7, D9 = 6. A(9,9) = 1 + A(S9, D9)/2 = 1 + A(6,7)/2 = 1 + 5/32 = 1.156 <- inbred!

Actual relatedness versus expected values from pedigrees

38

Values for the coefficient of coancestry (θ) and the coefficient of fraternity (Δ) obtained from pedigrees are expected values. Due to random segregation of genes from parents, The actual value (or realization) can be different. For example, we expect 2θ to be ½ for full subs. However, one pair of sibs may actually be more similar (0.6) and another less similar (say 0.35). On average, θ is ½ for pairs of sibs, but if we knew the actual value of θ, we have more information. With sufficient dense genetic markers, we can estimate these relationships directly.

Genomic selection uses this extra information.

What about coefficient of coancestry θ ?

39 39

40

Indiv x: 00 00 10 10 00 10 11 00 11 00

Indiv y: 10 00 11 11 10 11 11 10 11 10

Locus-specific θ

0.5 1.0 0.5 0.5 0.5 0.5 1.0 0.5 1.0 0.5

Estimated θ is the average over all ten loci, = 0.65

41

The coefficient of fraternity •  While (twice) the coefficient of coancestry gives the

weight on the additive variance for two relatives, a related measure of IDB status among relatives gives the weight on the dominance variance

•  The probability that the two alleles in individual x are IBD to two alleles in individual y is denoted Δxy, and is called the coefficient of fraternity.

•  This can be expressed as a function of the coefficients of coancestry for the parents of (mx and fx) of x and the parents (my and fy) of y. –  Δxy = θmxmyθfxfy+ θmxfyθfxmy

42

The coefficient of fraternity •  While (twice) the coefficient of coancestry gives the

weight on the additive variance for two relatives, a related measure of IDB status among relatives gives the weight on the dominance variance

•  The probability that the two alleles in individual x are IBD to two alleles in individual y is denoted Δxy, and is called the coefficient of fraternity.

•  This can be expressed as a function of the coefficients of coancestry for the parents of (mx and fx) of x and the parents (my and fy) of y. –  Δxy = θmxmyθfxfy+ θmxfyθfxmy

43

The coefficient of fraternity (cont)

•  x and y can have both alleles IBD if –  The allele from the father (fx) of x and the father (fy) of y are

IDB (probability θfxfy) AND the allele from the mother (mx) of x and the mother (my) of y are IDB (probability θmxmy) , or θfxfy θmxmy

–  OR the allele from the mother (mx) of x and the father (fy) of y are IDB (probability θmxfy) AND the allele from the father (fx) of x and the mother (my) of y are IDB (probability θfxmy) , or θmxfy θfxmy

–  Putting these together gives •  Δxy = θmxmyθfxfy+ θmxfyθfxmy

44

x y

fx fy mx my

Δxy = θmxmyθfxfy + θmxfyθfxmy

θmxmy θfxfy

θmxfy

θfxmy

Δxy, The Coefficient of Fraternity

Δxy = Prob(both alleles in x & y IBD)

45

Examples of Δxy: Full sibs •  Full sibs share same mon, dad

–  mx = my = m, fx = fy = f –  Δxy = θmxmyθfxfy + θmxfyθfxmy = θmmθff + θmf

2

–  Δxy = (1+fm)(1+ff)/4 + θmf2

•  If parents unrelated, θfm = 0, giving –  Δxy = (1+fm)(1+ff)/4

•  If parents are unrelated and not inbred, –  Δxy = 1/4

46

Examples of Δxy: Half sibs •  Paternal half sibs share same dad, different

moms –  fx = fy = f; mx and my –  Δxy = θmxmyθfxfy + θmxfyθfxmy = θmxmyθff + θmxf θmyf

–  Δxy = θmxmy (1+fm)/2 + θmxf θmyf

•  If mothers are unrelated to each other and to the common father, θmxmy = θmxf = θmyf = 0, giving –  Δxy = 0

47

When is Δ non-zero? •  Since Δxy = θmxmyθfxfy + θmxfyθfxmy • A nonzero value for Δ requires either

– That the fathers of both x and y are related AND the mothers of both x and y are related

– OR that the father of x is related to the mother of y AND the mother of x is related to the father of y

48

B A D C

E F

From before

θAA= θDD = 1; θBB = θCC = 1/2; θAD = 1/2, θEF = 1/8, θAB = θAC = θBC = θBD = θCD = 0

S1,S2

What is Δ for the full sibs (S1 and S2)?

Δxy = θmxmyθfxfy + θmxfyθfxmy = θEEθFF + θEF2

Giving Δxy = θEEθFF + θEF2

= (1/2)(1/2) + (1/8)2 = 1/4 + 1/64 = 17/64 = 0.266

49

Δxy and the coefficient on VD

•  The coefficient on the dominance variance for the relatives x and y is just Δxy.

•  To see this, –  let AiAj denote the two alleles in x and AkAl those

in y. –  Suppose that alleles i and k come from the

mothers of these two relatives and alleles j and l from their fathers.

–  Cov(dominance values) = Pr(Ai ibd Ak; Aj ibd Al ) cov(δij, δkl) + Pr(Ai ibd Al; Aj ibd Ak)cov(δij, δkl)

–  = (θfxfyθmxmy + θmxfyθjxmy) Var(D) = Δxy Var(D)

50

General Resemblance between relatives

51

Example B A D C

E F S1,S2

We found for full sibs S1, S2 that θ = 5/16, hence 2θ = 5/8; Δ = 17/64

Expected genetic covariance between this sibs is

(5/8)Var(A) + (17/64)Var(D) + (5/8)2Var(AA) + (5/8) (17/64)Var(AD) + (17/64) 2Var(DD) + …

52

Covariance among selfed lines •  A common situation in plant breeding is that two

inbred lines are crossed –  the frequency of any segregating allele in the F1/F2 is 1/2 –  Starting with the F1, a series of lines is formed by selfing, Sk

= F2+k. –  The covariance among the various Sk lines is important in

the response to selection in selfed lines –  In particular, we want to covariance between an Sj and an Sk

line whose last common ancestor was the Si, with i < j < k

53

Autotetraploids •  Peanut, Potato, alfalfa, soybeans all examples

of crops with at least some autotetraploid lines

•  With autotetraploid, four alleles per locus, with a parent passing along two alleles to an offspring

•  As a result, a parent can pass along the dominance contribution in G to an offspring

•  Further, now there are four variance components assocated with each locus

54

Genetic variances for autotetraploids

•  G = A + D + T + Q –  A (additive) and D (dominance, or digenic effects)

as with diploids –  T (trigenic effects) are the three-way interactions

among alleles at a locus –  Q (quadrigenic effects) are the four-way

interactions at a locus •  Total genetic variance becomes

–  VG = VA + VD + VT + VQ

55

Resemblance between autotetraploid relatives

Relatives VA VD VT VQ

Half-sibs 1/4 1/36

Full-sibs 1/2 2/9 1/12 1/36

Parent-offspring 1/2 1/6

Assumes unrelated, non-inbred parents

.

Module 1 Lecture 6 1

Lecture 6 Heritability and Field Design

Lucia Gutierrez lecture notes

Tucson Plant Breeding Institute

January 2018 Tucson, Arizona

1

Module 1: Lecture 6

Selection Response

0µSµ

Y)(selection toresponse R

X)( ldiferentiaselection S

pointn truncatio c

sindividual selected ofprogeny a ofmean

population R.M. from sindividual selected a ofmean

population Mating Random initial theofmean

1

0

∆=∆=

====

µµµ

S

1µ

S

R

2,

,

)(

hS

Rb

SbR

XbY

yx

yx

==

=∆=∆

Heritability: “Expected proportion of selection differential to be achieved as a gain from selection” (Hanson, 1963).

Holland et al. 20102

c

Module 1: Lecture 6

.


Heritability

REALIZED HERITABILITYThe relation between the observed response to selection and the selection differential.

NARROW SENSE HERITABILITYThe fraction of all trait variation due to variation in breeding values (additive variance).

BROAD SENSE HERITABILITYThe fraction of all trait variation due to variation in genotypic values (genetic variance).

P

A

V

Vh =2

P

G

V

VH =2

S

Rh

ˆ

ˆ2 =

3

Holland et al. 2010 Module 1: Lecture 6

Heritability

ADDITIVE (OR GENETIC) VARIANCEWe can estimate it from groups of related individuals:* Covariance Parent – Offspring: ½ VA* Covariance Half-sibs: ¼ VA* Covariance Full-sibs: ½ VA + ¼ VD .

TOTAL PHENOTYPIC VARIANCEVP = VG + VEEvaluate with proper designs

P

A

V

Vh =2

P

G

V

VH =2

4


.


Heritability

FUNCTION OF POPULATION

o Heritability is a function of both genetic and environmental variance, therefore a property of the population.

o Suppose one inbred line with a Mendelian inherited trait. How much is the genetic variance? Will the trait segregate? How much is the h2? Does it mean that the trait is not genetically determined? Since there is no genetic variation within this population, h2=H2=0. However, it does not mean that the trait is not genetically determined.

o Additionally, heritability is unreliable to predict future response to selection because while conducting selection the population changes.

5


Heritability

ENVIRONMENTAL VARIANCE

Since h2 is also a function of environmental variance, and decreasing environmental variance increases h2, controlled conditions would be optimal for identifying superior genotypes (predicting breeding values). However, caution should be exercised because GxE is important for many traits and therefore selecting in a non-targeted environment could be detrimental.

6


.


Parent-Offspring Regression

( ) ipipooi ezbz +−+= µµ |

deviationse

offspringth i ofparent of phenotypez

slope offspring-parent regression b

mean populationμ

offspringth -i theof phenotypez

i

pi

po|

oi

=

−=

==

=

LW 17

Offspring

Parent

-2 0

-1 5

-1 0

-5

0

5

10

15

20

-2 0 -1 5 -1 0 -5 0 5 10 15 20

7

Module 1: Lecture 6

Parent-Offspring Regression

( ) ( )( )

( )po

z

AAApo bhh

EpEobE |

222

24

122

1

p2

po| 2 ,

2

1,

z

z,zˆ ≅≅++≅=σ

σσσσ

σ

Regression one parent on offspring – no environment correlation among parent and offspring.

( ) ( )( )

( )( ) po

z

AAApo bhhbE |

222

21

24

122

1

P221

P1212

P221

P121

o

p2

po| ,

zz

zz,z

z

z,zˆ ≅=+=++==

σσσ

σσ

σσ

Regression one parent – offspring (one offspring or the mean of multiple offspring).

( ) ( )( ) po

z

AAApo bhhbE |

222

24

122

1

p2

po| 2 ,

2

1

z

z,zˆ ≅≅+≅=σ

σσσ

σ

Regression mid parent on offspring – no environment correlation among parent and offspring.

( ) ( )( ) 0|1:0

222

2212

122

12

S02

1:S0S00|1:0 ,

z

z,zˆSS

z

AADDASS bhhbE ≅≅+++==

σσσσσ

σσ

Regression parent – offspring inbreeding – no environment correlation.

LW 178

Module 1: Lecture 6

.


Mating Designs

FULL-SIB DESIGN: N full-sib families with n offspring each.

ijiij wfz ++= µ

iancefamily var-ithin w

on)contributi talenvironmen dominance, on,(segregatierror residualw

familyth -i theofeffect f

mean populationμ

familyth -i theof offspringth -j theof phenotypez

ij

i

ij

===

=

SoV df SS MS EMS

Among-families N-1 SSf/df(f)

Within-families N(n-1) SSw/df(w)

( )∑ −=i if zznSS 2

...

∑ −=ji iijw zzSS

,

2. )( 2

)(FSwσ

22)( fFSw nσσ +

LW 189

Module 1: Lecture 6

Mating Designs


ijiij wfz ++= µ

( )( )( ) ( ) ( ) ( )2

, ,, ,

,

,)(

f

ijijiijijiii

ijiiji

ijij

wwfwwfff

wfwf

zzFSCov

σ

σσσσµµσ

σ

=

+++=

++++=

=

224

122

12EcDAf σσσσ ++=

( )( )

2224

322

1

224

122

1222

224

122

122)(

EcEDA

EcDAEDA

EcDAPFSw

σσσσσσσσσσ

σσσσσ

+++=

++−++=

++−=

2)(

22FSwfP σσσ +=

LW 1810

Module 1: Lecture 6

.


Mating Designs


( )( )( ) ( ) ( )wfz

w

f wf

VarVarVar

MSVarn

MSMSVar

w

+==

−=

( )( )

( )[ ] [ ])1(

211)1(2)(

2

2

1

Var

Var

2

2

2

224

12

2

224

122

1

−−+−≅

≅

++=++==

nNntnthSE

th

hz

ft

FSFS

FS

z

EcD

z

EcDAFS σ

σσσ

σσσ

ijiij wfz ++= µ

SoV df SS MS EMS




...


,

2. )( 2

)(FSwσ

22)( fFSw nσσ +

LW 1811

Module 1: Lecture 6

Mating Designs

HALF-SIB DESIGN: N half-sib families with n offspring each.

ijiij wfz ++= µ




mean populationμ


ij

i

ij

===

=

SoV df SS MS EMS




...


,

2. )( 2

)(FSwσ

22)( fFSw nσσ +

LW 1812

Module 1: Lecture 6

.


Mating Designs

HALF-SIB DESIGN: N half-sib families with n offspring each.

ijiij wfz ++= µ




mean populationμ


ij

i

ij

===

=

SoV df SS MS EMS




...


,

2. )( 2

)(FSwσ

22)( fFSw nσσ +

( )( )( ) ( ) ( )wfz

w

f wf

VarVarVar

MSVarn

MSMSVar

w

+==

−= ( )

( )HS

z

AHS

th

hz

ft

4

4

1

Var

Var

2

22

24

1

≅

===σσ

LW 1813

Module 1: Lecture 6

Mating Designs

NORTH CAROLINA DESIGN I: Each male (N sire) is mated to several unrelated females (M dams) to produce n offspring per dam.

ijkijiijk wdsz +++= )(µ

)deviations iancefamily var-(withinerror residualw

sireth -i the tomated damth -j theofeffect d

sireth -i theofeffect s

mean populationμ

damth -j and sireth -i theoffamily thefrom offspringth -k theof phenotypez

ijk

ij

i

ijk

=

===

=

LW 1814

Module 1: Lecture 6

.


Mating Designs

NORTH CAROLINA DESIGN I: Each male (N sire) is mated to several unrelated females (M dams) to produce n offspring per dam.

ijkijiijk wdsz +++= )(µ

SoV df SS MS EMS

Sires N-1 MSs/df(s)

Dams(Sire) N(M-1) MSd/df(d)

Sibs(dams) T-NM MSw/df(w)

( )∑ −=ji is zzMnSS

,

2...

∑ −=ji iijd zzSS

,

2. )(

222sdw Mnn σσσ ++

( )

( )( )( ) ( ) ( ) ( )wdsz

w

d

s

Wd

dS

VarVarVarVar

MSVarn

MSMSVar

Mn

MSMSVar

w

++==

−=

−=( )( )

( ) ( )( )

PHS

z

EcD

z

EcDAFS

z

APHS

th

hz

dst

hz

st

4

2

1

Var

VarVar

4

1

Var

Var

2

2

224

12

2

224

122

1

22

24

1

≅

++=++=+=

===

σσσ

σσσσ

σσ

∑ −=kji ijijkw zzSS

,,

2. )(

22dw nσσ +

2wσ

LW 1815

Module 1: Lecture 6

Mating Designs

NORTH CAROLINA DESIGN II: A group of sires (NS sires) are mated to an independent group of dams (ND dams) to produce n offspring.

ijkijjiijk wIdsz ++++= µ


damth -j theand sireth -i ebetween thn interactio theofeffect Iij

damth -j theofeffect d

sireth -i theofeffect s

mean populationμ

damth -j and sireth -i theoffamily thefrom offspringth -k theof phenotypez

ijk

i

i

ijk

=====

=

LW 2016

Module 1: Lecture 6

.


Mating Designs

NORTH CAROLINA DESIGN II: A group of sires (NS sires) are mated to an independent group of dams (ND dams) to produce n offspring.

ijkijjiijk wIdsz ++++= µ

SoV df SS EMS

Sires Ns-1

Dams

Interaction

Nd-1

(Ns-1)(Nd-1)

Sibs NsNd(n-1)

∑ +−−=ji jiijI zzzzSS

,

2........ )(

222sdIw nNn σσσ ++

2

24

1

z

APHSt

σσ=

∑ −=kji ijijkw zzSS

,,

2. )(

22Iw nσσ +

2wσ

( )∑ −=j jsd zznNSS 2

.....

( )∑ −=i ids zznNSS 2

.....

222dsIw nNn σσσ ++

2

2224

1

z

EcGmAMHSt

σσσσ ++=

2

24

1

z

DIt σ

σ=

LW 2017

Module 1: Lecture 6

Full Diallele (all selfed and reciprocal crosses are made)Incomplete Diallele – no selfed crosses Incomplete Diallele – no selfed, no reciprocal crosses

Mating Designs

DIALLELS: A group of individuals (N) are mated to the same set of individuals (N) to produce n offspring.

ijkijjiijk wsggz ++++= µ


th-j andth -i parents ofability combining specific s

th-jparent ofability combining general g

th-iparent ofability combining general g

mean populationμ

damth -j and parentsth -j andth -i thefrom offspringth -k theof phenotypez

ijk

ij

j

i

ijk

=

=

===

=

LW 2018

Module 1: Lecture 6

.


Mating Designs

DIALLELS: A group of individuals (N) are mated to the same set of individuals (N) to produce n offspring. Analysis for incomplete diallelewithout selfed or reciprocal crosses.

ijkijjiijk wsggz ++++= µ

SoV df SS EMS

GCA N-1

SCA N(N-3)/2

Sibs (n-1)[N(N-1)/2-1]

( ) 222 2 GCASGAw Nnn σσσ −++

2

24

1

z

AGCAt

σσ=

∑ <−=

kji ijijkw zzSS,

2. )(

2wσ

( ) GCAji ijSCA SSzznSS −−= ∑ <2

....

( ) ( )∑ −−−=

i iGCA zzN

NnSS 2

.....

2

2

1

22SCAw nσσ +

2

24

1

z

DSCAt

σσ=

LW 2019

Module 1: Lecture 6

Genotypic Means

GENOTYPIC MEANS:

ijklijklijjiijkl pGEEGz ε++++=

B820

Module 1: Lecture 6

The environment includes non-genetic factors that affect the phenotype, and usually has a large influence on quantitative traits affecting heritability and response to selection.

Micro-environment. Environment of a single plant. Need to be controlled with experimental design.

Macro-environment. Environment associated to alocation and time. GxE is the norm and not the exception in plants. Therefore defining the target environments is a crucial part in plant breeding, both for variance component estimation and identifying superior genotypes.

.


Micro-environment variation

HERITABILITY: For related individuals, heritability can be calculated as previously shown. The previous calculation assumes individual plants are measured, and heritability on an individual plant basis is estimated. However, because quantitative traits measured on individual plants have large non-genetic effects, heritability on a mean basis is higher.

22222weFEFP σσσσσ +++=

Individual plant basis, n=1, r=1, e=1

2222

2

2

2

2

ˆˆˆˆ

ˆ1

4

ˆ

ˆ1

4

ˆweFEF

FP

p

FP

F

FFh

σσσσ

σ

σ

σ

+++

+=

+=

Plot basis, n=n, r=1, e=1

n

hw

eFEF

F

p

FF 2

222

2

2

22

ˆˆˆˆ

ˆ

ˆ

ˆ

σσσσ

σσσ

+++==

nw

eFEFP

22222 σσσσσ +++=

Plot basis, multiple env, n=n, r=r, e=e

ernere

hweFE

F

F

p

FF 222

2

2

2

22

ˆˆˆˆ

ˆ

ˆ

ˆ

σσσσ

σσσ

+++==

ernereweFE

FP

22222 σσσσσ +++=

Holland et al., 2010, B621

Module 1: Lecture 6

Experimental Design and Analysis

So we need good experimental designs to test genotypes!

FISHER’S PRINCIPLES. Statistical theory for efficient estimation (i.e. unbiased and of minimum variance) of treatment mean differences are based on 3 principles:

– Randomization , random assignment of treatments to experimental units provides valid estimation of experimental error, unbiased comparisons of treatments, and justifies statistical inference.

– Replication allows estimation of experimental error variance, and decreases mean variances (i.e. variance of a mean = variance of the observation divided by the number of replications).

– Local control is the grouping of homogenous experimental units. Well chosen blocks will decrease error variance.

22

Module 1: Lecture 6

.


Experimental Design and AnalysisCLASSIFICATION OF EXPERIMENTAL DESIGNS: 1. Experimental Units.

1. Homogenous –Complete Randomized Design (CRD)2. Heterogenous in one way – Randomized Complete Block Design (RCBD)3. Heterogenous in more than one way – Latin squares or latinized designs.

2. Large number of treatments.1. Incomplete Block Designs (IBD or Alpha)2. Unreplicated experiments (or Federer)

3. Modeling (post-blocking, spatial analysis)

CRD RCBD IBD

13

12

243

4 14

42

213

3 14

42

213

3

ijiijy εαµ ++= ijjiijy εβαµ +++= 23ijkjkjiijky εγβαµ ++++= )(

Module 1: Lecture 6

Complete Randomized Design (CRD)

TREATMENT ASSIGNMENT: Each treatment is assigned atrandom to the experimental units.� Experimental unit: one plot.

RICE EXAMPLE. YIELD COMPARISON OF 4 RICE CULTIVARSTreatments: 4 cultivars

Replications: 4 homogenous experimental plots per treatment

Experimental design: CRD

Dependent variable : Y = Grain yield (Kg ha-1)

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

33

33

2 22

2

11

11

444

4

EXPERIMENTAL PRINCIPLES1. Randomization2. Replication3. Local control

24

Module 1: Lecture 6

Randomize in R: >design.crd(trt, r) where trt is a list of all treat ments (genotypes) and r the number of reps>design.crd(c("G1", "G2", "G3", "G4"),4)

.



ijiijY εαµ ++=

(residual)error alexperimentε

ntth treatme-i theofeffect α

mean populationμ

repth -j on thent th treatme-i theof responseY

ij

i

ij

==

=

=

ASSUMPTIONS:

1. TO THE MODEL:• Is correct (in relation to the experimental units)• Is additive

2. EXPERIMENTAL ERRORS (for Hypothesis testing):• Are random variables• εij ~ N• E(εij) = 0 for every i, j• V(εij) = σ2 for every i, j• Are independent

3. “BY DEFINITION” ααααi = μi - μ

25

Module 1: Lecture 6

Analyze in R: >anova(lm(Y ~ trt, data=my.data))


T1 T2 T3 T4

47 50 57 54

52 54 53 65

62 67 69 74

51 57 57 59

53 57 59 63

SoV df SS MS F p

Geno (t-1) SST/gl(T) MST/MsE

Error t(r-1) SSE/gl(E)

Total tr-1 SSTOT/gl(TOT)

2... )(∑ −=

iiT yyrSS

∑ ∑= =

−=t

i

r

jiijE yySS

1 1

2. )(

2.. )(∑ −=

ijijTOT yySS

SoV df SS MS F p

Geno 3 208 69.3 1.29 0.323

Error 12 646 59.8

Total 15 854

58

26

Module 1: Lecture 6

.


Randomized Complete Block Design (RCBD)

TREATMENT ASSIGNMENT: Within a block, each treatment isassigned randomly to the experiemntal units within a block.Randomization in each block is independent.� Experimental unit: one plot.

WHEAT EXAMPLE. PLANT HEIGHT OF 5 ADVANCED LINES OF WHEAT IN 5 BLOCKS

Treatment: 5 advanced wheat linesBlock: 5 diferent blocksExperimental design: RCBDDependent variable: Y = cm

3 3

3

3

2 22

2 11

11

4

44

4

3

2

14

55

5

5

5B 1 B 2 B 3 B 4 B 5

EXPERIMENTAL PRINCIPLES1. Randomization2. Replication3. Local control

27

Module 1: Lecture 6

Randomize in R: >design.rcbd(trt, r) where trt is a list of all trea tments (genotypes) and r the number of blocks>design.rcbd(c("G1", "G2", "G3", "G4“, “G5”),5)


ijjiijY εβαµ +++=

ASSUMPTIONS:

1. TO THE MODEL:• Is correct (in relation to the experimental units)• Is additive• There is NO treatment by block interaction

2. EXPERIMENTAL ERRORS (for Hypothesis testing):• Are random variables• εij ~ N• E(εij) = 0 for every i, j• V(εij) = σ2 for every i, j• Are independent

3. “BY DEFINITION” ααααi = μi - μ

(residual)error alexperimentε

blockth -j theofeffect

ntth treatme-i theofeffect α

mean populationμ

blockth -j on thent th treatme-i theof responseY

ij

j

i

ij

=

==

=

=

β

28

Module 1: Lecture 6

Analyze in R: >anova(lm(Y ~ trt + block, data=my.data))

.



T1 T2 T3 T4 T5

B1 8 10 8 9 10 9

B2 7 9 8 8 9 8.2

B3 6 8 9 8 8 7.8

B4 6 7 9 8 7 7.4

B5 7 9 10 7 9 8.4

6.8 8.6 8.8 8.0 8.6 8.16

SoV df SS MS F p

Block (r-1) SSB/gl(B)

Geno (t-1) SST/gl(T) MST/MSE

Error (t-1)(r-1) SSE/gl(E)

Total tr-1

2... )(∑ −=

iiT yyrSS

∑ ∑= =

−−−=t

i

r

jjiijE yyyySS

1 1

2.... )(

2.. )(∑ −=

ijijTOT yySS

SoV df SS MS F p

Block 4 7.36 1.84 2.77 0.0637

Geno 4 13.36 3.34 5.02 0.0081

Error 16 10.64 0.665

Total 24 31.36

2... )(∑ −=

jjB yytSS

29

Module 1: Lecture 6

Field Designs and Heritability

ijkijkijkjijk PEPEy εβµ +++++= )(

SoV df EMS

Environment (e-1)

Block(Env) e(r-1)

Progeny (n-1)

ProgenyXEnv (n-1)(e-1)

Pooled error (n-1)(r-1)e

ProgenyPEεProgeny reVrVVMS ++=

FACTORIAL DESIGNS (HS, FS, RIL, DH, Clones)

PEεPE rVVMS +=

εError VMS =

( )D

2

AProgeny V4

1V

2

1V:FS

FF +++=

AProgeny V4

1V:HS

F+=

AProgeny V2V:RIL =ernere

hweFE

F

F

p

FF 222

2

2

2

22

ˆˆˆˆ

ˆ

ˆ

ˆ

σσσσ

σσσ

+++==

B630

Module 1: Lecture 6

.


Number of treatments

HOW TO DEAL WITH HIGH NUMBER OF TREATMENTS?

1. STRATIFICATION: Group genotypes with similar characteristics(maturity, color, family), compare within groups. NO BETWEEN GROUP COMPARISONS.

2. PRODUCE HOMOGENOUS EXPERIMENTAL UNITS: Make everyeffort to homogenize experimental area (look for soil similarity, fieldconditions to reduce variation, choose seeds of similar vigor).

3. USE REPEATED CHECKS: You may use checks in a systematicway to control or model soil heterogeneity.

4. EXPERIMENTAL DESIGN WITH SPATIAL CONSIDERATIONS. Use experimental designs that include a large number of treatmentswhile controling variability (i.e. alpha designs, unrep, etc.).

31

Module 1: Lecture 6

Other Considerations (Design)

1. PLOT SIZE:a. Big enough to have plants growing normally (i.e. one-plant does

not work in crops but works fine for trees).b. Big enough to represent trait variation (i.e. you might need larger

plots for yield than for maturity).c. Not too big as to have considerable within-plot variation (i.e.

increasing plot size increases within-plot variation).d. Balance between cost of increasing plot size and increasing

number of plots.

EXAMPLE. Mohammad et al. (2001) compared plot size and number of replications to detect differences in wheat. Bigger plots require less reps.

r=2

r=3

r=4

r=5

Difference to be detected

Plo

t siz

e

32

Module 1: Lecture 6

.



2. PLOT SHAPE: a. Balance between plot border and plants within plot (i.e.

rectangular plots have more border than squared ones).b. Take into account gradients (i.e. if unidirectional gradient like

fertility long and narrow plots might be better).

EXAMPLE. Haapanen(1992) compared plot size and shape for height in pines

33

Module 1: Lecture 6


3. ROW vs. HILL PLOTS:

EXAMPLE. Tragoonrung et al., 1990 compared row and hill-plots for different traits in barley. Hill-plots are ok for most traits but not for yield.

34

Module 1: Lecture 6

.



4. BORDER ROW. It is a good idea to include a row of plants as a border row to avoid having plots in the margins of the experiment performing differently due to lack of competition.

5. OPERATORS NOISE. If measurements on the experiment will be performed by different people it is a good idea to consider operators noise by having different people measuring for example in different blocks.

6. EARLY TESTING. Small amount of seeds are available in early generations. Therefore replicating is a challenge. Taking this into consideration when deciding which traits will be measured early is important.

7. OTHER DESIGNS AND SPATIAL CORRELATIONS. More efficient designs exist for testing large numbers of genotypes in fields that might not be completely heterogeneous. Additionally, spatial corrections might improve estimations of genotypic means. 35

Module 1: Lecture 6

Other Considerations (Estimation)

1. MIXED MODELS. Careful interpretation of Expected Means Squares should be exercised especially when using Mixed Models. EMS is different if factors (i.e. environments, genotypes, etc. ) are defined as random or fixed effects:

a. E, G, GxE random:

within env:

a. G, GxE random, E fixed:

across env.:

222)( GIeG nNnEMS σσσ ++=

22)( GeG nNEMS σσ +=

( )222

222

2 IGe

IG

Fh

σσσσσ

+++=

( )2222

22

2 EIGe

G

Fh

σσσσσ

+++=

LW 2236

Module 1: Lecture 6

.


Other Considerations (Estimation)

2. NON-BALANCED DESIGNS . When designs are not balanced due to their design, missing plots or missing data, MS calculations are not straight-forward. There are four ways to estimate MS.

MIXED MODELS - INCOMPLETE BLOCKS – UNBALANCED

With unbalanced data, mixed models, or correlation among genotypes, variance component estimation of heritability is not related to the gain from selection. Using the concept of “effective error variance” (Cochran and Cox, 1957), Cullis et al (2006) proposed to use a generalized method for heritability:

22

21

G

BLUPhσ

υ−=

Piepho and Mohring 2007

BLUP twoof difference a of ncemean varia:BLUPυ

37

Module 1: Lecture 6

Module 1: Lecture 7 1

Lecture 7 QTL Mapping


Tucson Plant Breeding Institute

January 2018 Tucson, Arizona

. 1

Problem

QUANTITATIVE TRAIT: Phenotypic traits of poligenic effect (coded by many genes withusually “small” effect each one) and with environmental influence.

How To Select For Quantitative Traits?1. Traditional Breeding

2. Marker Assisted Selection

3. Genomic Selection

Identification of chromosome regions that affects quantitative traits

Chromosome

Molecular Markers

Gen evidence

. 2


Information needed

1. Molecular marker scores

2. Genetic map

3. Phenotypes

High throughput panels, controlled conditions, repeatable, cheap, automatic scoring.

More standard methods, small population sizes, consensus maps? Need some more development.

Crucial part, poor phenotypes means poor QTL mapping.

. 3

Outline

1. Linkage

2. Types of Populations

3. Map construction using linkage (overview)

4. QTL mapping using linkage

o QTL mapping: 1. Singel Marker Analysis

o QTL mapping: 2. Interval Mapping

o QTL mapping: 3. Composite Interval Mapping

5. Other issues:

o Multiple Testing

o Missing Data

o Epistasis

o QTLxE

o Polyploids

6. QTL estimation. 4


Linkage and Mendel’s Laws

LAW OF SEGREGATION (MENDEL’S FIRST LAW):

o Every individual carries two copies of a gene (alleles)

o Each parent passes only one of its copies to an offspring

o Parents A1A1 or A2A2 only produces ‘A1’ or ‘A2’ gametes respectively, heterozygous produces ‘A1’ and ‘A2’ in 50/50.

LAW OF INDEPENDENT ASSORTMENT (SECOND LAW):

o Different genes segregate independently

o True in the absence of linkage

Wu, Ma, and Casella, 2010. 5

Linkage and RecombinationMolecular marker: short DNA segment (or a

single base in the case of SNPs)Locus: point on a chromosome (loci in plural).

i.e: markers 1 and 2Allele: gene variant. Yellow = maternal. Red =

paternal

Marker 1 (A)

Marker 2 (B)

x

A1A1B1B1

Selfing the F1 produces two types of gametes: - Parental: same combination as in original

lines (A1B1 or A2B2) - Recombinant: one allele from each parent

(A1B2 or A2B1)

PR R

Marker 1

Marker 2

P

Define r as the recombination frequency: r = # recombinants / total

If there is linkage there are more parental than recombinant gametes– r closer to 0 = strong linkage– r closer to 0.5 = independence

2

1 r−2

1 r−2

r

2

r

Wu, Ma, and Casella, 2010

A2A2B2B2

A1A2B1B2

. 6


Linkage and Recombination• The higher the recombination between two loci, the higher the genetic distance between them.

• If independent all four gametes are equally frequent with 0.25 each one (maximum recombination = 0.5).

• Mapping function:

• Therefore recombination ratios (under certain circumstances) can be used to determine genetic distances (among loci, markers or between markers and QTL).


Marker 1 (A)

Marker 2 (B)x

A1A1B1B1

PR R

Marker 1

Marker 2

P

2

1 r−2

1 r−2

r

2

r

A2A2B2B2

A1A2B1B2

. 7

QTL Mapping

KEY IDEA:

If a molecular marker is “associated” to the phenotype (i.e. the mean trait value for individuals with marker state MM is different from the mean trait value of individuals with marker state mm), then the marker is linked to a QTL.

. 8


Populations

We need genetically diverse populations!

There are two options:1. Design a population with known recombination even ts .

o Types of populations: RIL, DH, F2, BC, etc.o Linkage mapping (also known as: “Traditional QTL Mapping”,

“Bi-parental QTL Mapping”, “Balanced population QTL Mapping”, “ QTL Mapping”, etc.).

2. Use existing diverse populations. o LD related to distance + OTHER causes.o Need to account for other causes of LD.o Association Mapping (also known as: “Linkage

Disequilibrium Mapping”, “LD Mapping”, “GPD Mapping”, “GWAS”, etc.).

. 9

Designed Populations – F 2

F2 populationA2

The F1 is selfed one time.All 3 possible genotypes are present: A1A1, A1A2, and A2A2.Short ‘history’ of recombination.Allows to distinguish additivity from dominance.

A1

P1 P2

. 10


Designed Populations – BC

Back-cross population (BC)

The F1 is “back” crossed to one of the parents.The BC lines carries

o a full chromosome from the recurrent parento a chromosome with mosaic of the two parents

Possible genotypes A1A1 or A1A2 (the A2A2 is not present)Short ‘history’ of recombination.

A2A1

P1 P2

P1

. 11

Designed Populations - DH

Doubled-Haploid Population (DH): xA1 A2

F1

F1 -gametes

DH

F1 gametes are duplicated.Complete homozygous individuals (only A1A1 and A2A2 genotypes possible).One generation of recombination.

P1 P2

. 12


Designed Populations - RIL

Recombinant Inbred Lines (RIL):

F2 are selfed for several generations.Heterozygosity decreases ½ each generation.More generations means more recombination.

A2A1

P1 P2

. 13

Designed Populations – MP

Multiparent or 4-way Cross Population (MP):

. 14


Map construction

Markers

STUDY LINKAGE AMONG MARKERS

. 15

Map construction

Markers

recombination

Markers 1, 5 and 10 are linked.

Markers 1 and 5 are more closely linked than 10.

10

. 16


Map construction

Markers

recombination

Markers 2 and 7 are also linked

. 17

Map construction

. 18


Mapping Traits (Qualitative):

R R RR R RS S SS SSS S S SSS S


Two-point analysis

. 19

Mapping Traits (Quantitative):

1

2

3456789

125112 118 118 129 115122 12299 101 108 92 100 124105 95 117 103123

How to test for an association?

. 20


Mapping Traits (Quantitative)

n=500a=8d=0mu=50sd=2

Fre

quen

cy

35 40 45 50 55 60 65

01

020

3040

50

AAABBB

44

4648

5052

5456

n=500a=0d=0mu=50sd=2

Fre

quen

cy

35 40 45 50 55 60 65

010

2030

4050

AAABBB

Phenotypic distribution is a mixture distribution.

A1A1A1A2A2A2

A1A1A1A2A2A2

F2 Population

. 21


n=500a=8d=0mu=50sd=2

Fre

quen

cy

35 40 45 50 55 60 65

010

2030

4050

AAABBB

4446

4850

5254

56

n=500a=0d=0mu=50sd=2

Fre

quen

cy

35 40 45 50 55 60 65

010

2030

4050

AAABBB

-1.0 -0.5 0.0 0.5 1.0

3540

455

055

6065

-1.0 -0.5 0.0 0.5 1.0

35

4045

5055

6065

How do we test for an association?• Linear models:

t-test, ANOVA (F-test), regression,

• Maximum Likelihood: LRT.

A1A1

A1A2

A2A2

A1A1

A1A2

A2A2

. 22


MMQQ


Assume:• a co-dominant marker (M)• a QTL (Q)• r linkage between M and Q

Genotypic values:• .• .• .

QQfor a P +Qqfor d P +

qqfor a - P

mmqq

MmQq

Marker . Conditional frequency .

Genotype Frequency QQ Qq qq

MM ¼ (1-r)2 2r(1-r) r2

Mm ½ r(1-r) 1-2r+2r2 r(1-r)

mm ¼ r2 2r(1-r) (1-r) 2

Value of F2 individuals

F1 Gametes: Pr(MQ) = ½ (1-r) = Pr(mq)Pr(Mq) = ½ r = Pr(mQ)

a P + d P + a P −

F2 Population

BUT QTL IS NOT ON TOP OF OBSERVED MARKERSo we use conditional probabilities of QTL genotypes

)Pr(M

)MPr(Q)M|Pr(Q

j

jkjk =

F2 Individuals: ( ) ( ) ( )2

41

21

21

11*1

Pr(MM)

Pr(QQMM)MM)|Pr(QQ r

rr −=−−==

Bernardo, 2010

. 23


Marker . Conditional frequency .

Genotype Frequency QQ Qq qq

MM ¼ (1-r)2 2r(1-r) r2

Mm ½ r(1-r) 1-2r+2r2 r(1-r)

mm ¼ r2 2r(1-r) (1-r) 2

Value of F2 individuals

F2 Population

r)d-2r(1 2r)-a(1 P )(Fmm

)2r 2r - d(1 P )m(FM

r)d-2r(1 2r)-a(1 P )(FMM

2

22

2

+−=++=

++=Mean of individuals:

( ) 2r)-2a(1 mm - MM2F =

2

F

2r)-d(12

mm MM - mM

2

=

+

Differences:

ASPECTS TO NOTE:1. QTL effect and position are

confounded (i.e. the same mean difference could be achieved with a tightly linked QTL of small effect than with a loosely linked QTL of large effect).

2. In F2 it is possible to estimate additive and dominance effects.

3. Both a* and d* are underestimated due to the unknown fraction of r.

a P + d P + a P −

Bernardo, 2010

. 24


Mapping Traits (Quantitative)F2 POPULATION

r)d-2r(1 2r)-a(1 P )(Fmm

)2r 2r - d(1 P )m(FM

r)d-2r(1 2r)-a(1 P )(FMM

2

22

2

+−=++=

++=Mean of individuals:

( ) 2r)-2a(1 mm - MM2F =

2

F

2r)-d(12

mm MM - mM

2

=

+

Differences:


F2 Individuals: MMQQ, MMQq, MMqqMmQQ, MmQq, MmqqmmQQ, mmQq, mmqq

BC POPULATION

r)-a(1 dr P (BC)mm

ar - r) - d(1 P m(BC)M

++=+=

Mean of individuals:

( ) 2r)-d)(1(a mm - mM BC +=Differences:

BC Individuals: MmQq, MmqqmmQq, mmqq


P2 Gametes:Pr(mq)=1

( ) ( )rr −=−== 1

1*1

Pr(Mm)

Pr(QqMm)Mm)|Pr(Qq

21

21

Bernardo, 2010

. 25


DH POPULATION

2r)-a(1 P (DH)mm

2r)-a(1 P (DH)MM

−=+=


( ) 2r)-2a(1 mm - MM DH =Differences:

2R)-a(1 P (RIL)mm

2R)-a(1 P (RIL)MM

−=+=

( )

mating-sib from RILfor 6r1

4rR

selfing from RILfor 2r1

2rR

2R)-2a(1 mm - MM RIL

+=

+=

=

RIL POPULATION


Differences:


DH Individuals: Pr(MMQQ) = ½ (1-r) = Pr(mmqq)Pr(MMqq) = ½ r = Pr(mmQQ)

( ) ( )rr −=−== 1

1

Pr(MM)

Pr(QQMM)MM)|Pr(QQ

21

21

Bernardo, 2010

. 26


How to map QTL?

Steps for mapping QTL through LINKAGE:

1. Create a designed population.

2. Collect genotypic information on parents and offspring in the form of molecular markers scores.

3. Look for linkage between marker loci.

4. Construct a genetic map.

5. Detect quantitative trait loci (testing for association between a phenotypic trait and a marker).

o Qualitative trait: two-point linkage test.

o Quantitative trait: linear models (t-test, ANOVA, marker-regression) or maximum likelihood tests (LRT). Use single marker analysis, interval mapping or composite interval mapping.

. 27

QTL mapping: 1. Single Marker Analysis

SINGLE MARKER ANALYSIS (SMA), MARKER REGRESSION (MR):

IDEA: If there is a significant association between a molecular markerand a quantitative trait, then, it is possible that a QTL exists close tothat marker. A marker at a time is tested through a linear model (i.e. t-test, ANOVA or regression), or using the full density function for themixture distribution.

WHEN TO USE IT? To look at the data roughly and to study missingdata patterns. OK if you are not interested in estimating position norQTL effects.

iii xy εββ ++= 10

( ) ( )∏∑=i j jiij ypL 2,;, σµφσµ

Linear Model:

Maximum Likelihood:

. 28


hv m30 0.0

e38m50l 21.2

30.4

abc 302 52.9bc d298 58.3

ms rh 78.2abg69 86.0

ebmac 539a 110.4hv dhn7 117.2hv dhn9

126.7

e 33m42d 150.8

rga23 176.3

Bmag337

Fre

quen

cy

35 40 45 50 55 60 65

010

2030

4050

AAABBB

4446

4850

5254

56

Fre

quen

cy

35 40 45 50 55 60 65

010

2030

4050

AAABBB

Broman and Sen, 2009

A1A1

A1A2

A2A2

A1A1

A1A2

A2A2


. 29F

requ

ency

35 40 45 50 55 60 65

010

2030

4050

AAABBB

4446

4850

5254

56

Fre

quen

cy

35 40 45 50 55 60 65

010

2030

4050

AAABBB

A1A1

A1A2

A2A2

A1A1

A1A2

A2A2

QTL mapping: 1. Single Marker AnalysisLINEAR MODELS Use the difference in mean trait value for differentmarker genotypes to detect QTL and to estimate its effects.

One way ANOVA

Regression

p-values are used for profile plots

ijiijy εαµ ++=

Value of the trait in the j-th individual with marker genotype i

Effect of marker genotype i on trait value

ijiij xy εββ ++= 10

Value of the trait in the j-th individual with marker genotype i

marker genotype i

Effect of marker genotype o trait value

. 30


Fre

quen

cy

35 40 45 50 55 60 65

010

2030

4050

AAABBB

4446

4850

5254

56

Fre

quen

cy

35 40 45 50 55 60 65

010

2030

4050

AAABBB

A1A1

A1A2

A2A2

A1A1

A1A2

A2A2

QTL mapping: 1. Single Marker AnalysisMAXIMUM LIKELIHOOD METHODS Use the entire distribution ofthe data. It is more powerful as linear models but not as flexible.

ML Function

Test

LOD scores are used for profile plots

( ) ( ) ( )jk

N

iQkj MQzMz |Pr,,|

1

2∑=

= σµϕl

Trait value given the marker genotype

Density function for a normal distribution of the trait value given the QTL genotype with mean (µ Qk).

Conditional probability of QTL genotype given marker genotype.

=

(z)max

r(z)max -2lnLR

ll

61.4

)(

(z)max

r(z)max -logLOD(c) 10

cLR≅

=

ll

Max. Likelihood under no-QTL

Max. Likelihood under full model

. 31

QTL mapping: 1. Single Marker AnalysisLOD score or –log(p-value)

. 32


Single Marker Analysis (+):

• Simple

• No specific software requirement

• No chromosome position requirement

• No estimation of QTL position

Single Marker Analysis (-):

• Single QTL model

• Confounding of QTL estimation and position

• Loss of power due to residual variance caused by other QTL

• n tests


. 33

Issue 1: Multiple Testing

. 34



. 35


A common practice is to use P<0.05 to decide about significance of a test. But with large number of tests the chance of having a false positive is almost 1...

OTHER OPTIONS1. Bonferroni multiple-testing protection:

1. P = 0.05 / number of tests (this is very conservative)2. P = 0.05 / effective number of tests, with the effective number

of tests estimated from the marker data (Li and Ji, 2005).

2. Permutations (Broman and Sen, 2009)

3. False Discovery Rate (Benjamini and Hochberg, 1995)

. 36


Issue 2: Missing data problem

Can we say something about unobserved positions?

Yes, we can use information from neighbouring markers if we know the recombination history of the population we can calculate Conditional probabilities of QTL genotypes

?

FILLING INFORMATION

o We only have information at specific positions (marker positions). We might be interested to make in-between marker inference.

o Marker information can be incomplete (missing values) or partially informative (dominant markers).

)Pr(M

)MPr(Q)M|Pr(Q

j

jkjk =

Broman and Sen, 2009. 37

Issue 2: Missing data problem

1. QTL genotype probabilities can be computed at any point on the genome using observed markers (Markov chain methods, Jiang and Zeng, 1997).

2. Missing values in markers can be imputed (so no need to exclude genotypes with missing marker data).

3. The conditional QTL probabilities are at the core of QTL mapping methods (but that is taken care for by the software).

. 38


QTL mapping: 2. Interval Mapping

SIMPLE INTERVAL MAPPING (SIM):

IDEA: information on adjacent markers can be used to improveestimations. You may either use the genotypic predictors (pseudo-marker) or the conditional probabilities directly to scan for significantassociations one marker (or pseudo-marker) at a time or to use the fulllikelihood function for the interval.

WHEN TO USE IT? To fill n missing data information or to enrichmarker data information. OK if background QTL are not important.

iii xy εββ ++= 10

( ) ( )∏∑=i j jiij ypL 2,;, σµφσµ

Linear Model:

Maximum Likelihood:

. 39

hv m30 0.0

e38m50l 21.2

30.4

abc 302 52.9bc d298 58.3

ms rh 78.2abg69 86.0


126.7

e33m42d 150.8

rga23 176.3

Bmag337

Uses contiguous marker information to improve the estimation of marker effects:

Marker 1 QTL Marker 2

c1 c2

c1c2

2

2

=

−

=

=

2cc

)MqqMMPr(M

2)c)(1-c(1

2cc

2)MQqMMPr(M

2)c)(1-c(1-

)MQQMMPr(M

212211

21212211

212211

( ) ( )( )

( )( )( )

( )212

22

21

2211

212

21212211

2211

c1

cc)MMMM|Pr(qq

c1

c1c1c2c)MMMM|Pr(Qq

)MMMM|Pr(QQ

−=

−−−=

−−−= 2

12

22

21

1

11

c

cc

Conditional probability of the genotypes

Conditional probability of the QTL



L function

Maximum Likelihood Methods:Uses the likelihood function and the conditional probabilities inside the interval defined by two markers to determine the most likely position of the QTL inside the interval.

. 40


hv m30 0.0

e38m50l 21.2

30.4

abc 302 52.9bc d298 58.3

ms rh 78.2abg69 86.0


126.7

e 33m42d 150.8

rga23 176.3

Bmag337

Uses contiguous marker information to improve the estimation of marker effects:

Marker 1 QTL Marker 2

c1 c2

c1c2

2

2

=

−

=

=

2cc

)MqqMMPr(M

2)c)(1-c(1

2cc

2)MQqMMPr(M

2)c)(1-c(1-

)MQQMMPr(M

212211

21212211

212211

( ) ( )( )

( )( )( )

( )212

22

21

2211

212

21212211

2211

c1

cc)MMMM|Pr(qq

c1

c1c1c2c)MMMM|Pr(Qq

)MMMM|Pr(QQ

−=

−−−=

−−−= 2

12

22

21

1

11

c

cc

Conditional probability of the genotypes

Conditional probability of the QTL


c.120

c.10

c.20

c.40c.50

c.60c.70

c.90

c.100

c.110

c.130c.140c.150

c.160

c.170


Haley-Knott Regression:Uses the conditional probabilities calculated inside the interval defined by two markers directly as pseudo-markers and performs a regression on each point.

. 41

Simple Interval Mapping (+):

• Evaluation at and between markers

• Estimation of QTL position

• No specialized software (only for conditional distribution calculations)

Simple Interval Mapping (-):

• Single QTL model

• Loss of power due to residual variance caused by other QTL

• n-1* tests

With high marker density it is very similar to marker regression


. 42


We have fitted individual (virtual) markers to phenotypic responses: single QTL models (Single Marker Analysis and Simple Interval mapping). However, we expect a particular phenotypic trait to be caused by a number of QTL simultaneously.

When testing for individual markers, tests for QTL will have diminished power because of QTL segregating at other positions than the evaluation position.

What strategy to follow to arrive at multi-QTL mode ls?

Use markers earlier identified as close to or coinciding with significant QTL/ genes as covariables (co-factors) in a subsequent genome wide scans.

QTL mapping

. 43

QTL mapping: 3. Composite Interval

COMPOSITE INTERVAL MAPPING (CIM):

IDEA: On top of using contiguous marker information, use backgroundloci to get a better estimation of QTL effects. MR and SIM providesbiased estimation when multiple QTL are close to a marker and haveless power in general. The problem is how to select the cofactors.

WHEN TO USE IT?: It is the preferred method because it has morepower and decreases bias due to contiguous QTL.

. 44


hv m30 0.0

e38m50l 21.2

30.4

abc 302 52.9bc d298 58.3

ms rh 78.2abg69 86.0


126.7

e 33m42d 150.8

rga23 176.3

Bmag337

c.120


Uses markers as cofactors to improve the estimation of genetic background interactions.

No cofactor is allowed within windows of specific size to avoid over fitting.

Conditional probabilities in-between markers are still used to improve estimations.

1

2iii CCMy εµ ++++= 21

iii CMy εµ +++= 2

iii CMy εµ +++= 1

Outside both windows:

Inside window 1:

Inside window 2:

c.10

c.20

c.40c.50

c.60c.70

c.90

c.100

c.110

c.130c.140c.150

c.160

c.170

Broman and Sen, 2009. 45

Cofactors excluded when 50 cM or less apart Cofactors excluded when ONLY 5 cM or less apart

Careful with interpretation. Sharp peaks are caused by the window size, not a higher precision for QTL location.


. 46


Composite Interval Mapping (+):

• Evaluation at and between markers

• Estimation of QTL position

• Control of genetic background interactions

• Multiple QTL screened

Composite Interval Mapping (-):

• How to select marker-cofactors?


. 47

Power and Repeatability: 1. Underestimation of the number of QTL2. Over-estimation of effects

Beavis effect

Bernardo 2010. 48


QTL LOCATION (point estimate).o After having identified the significant genetic predictors, QTL are typically located at the positions of the maximum value for the test statistic in a chromosome region where all genetic predictors exceeded the threshold for significance.o This final identification of QTL using LOD, t, F, Wald, deviance or –log10(p) values is not trivial due to the irregularity of many test statistic profiles.

QTL EFFECTS (point and interval estimates)o Given that QTL and their locations were identified, point and interval estimates for QTL effects can be obtained from fitting a linear (mixed) model with all identified QTL. o Some clean up procedure maybe necessary after genome scans to arrive at a final acceptable multi-QTL model.o 95%CI (F2 and BC) = 530/(nr of individuals*fraction explained phenotypic variance) (Darvasi and Soller, 1997).

QTL estimation

. 49

EXPLAINED PHENOTYPIC VARIATION BY QTLo Extra sums of squares/ partial r2.o Comparing (residual) genetic variances between models. o From expressions in quantitative genetics and standard mixed model applications.

QTL estimation

. 50


Issue 3: EpistasisEpistasis might be important for some traits. There are several options for detecting Epistasis:

Linear models. Include an interaction term to be tested in any of the linear models discussed earlier and test for significance. The problem is the large number of potential interaction terms leads to supersaturated models (i.e. there are more parameters (p) to estimate than there are independent samples (n)). One possibility is to use a forward selection approach adding interaction terms one at a time and compare the models.

Bayesian approaches. You may fit a model including all possible terms in the model assuming that parameters are either close to zero, or have a large value.

. 51

Issue 4: PolyploidsTYPE OF POLYPLOIDS

o Allopolyploids (i.e. chromosome sets derived from different species like in wheat) with meiosis pairing of ancestral genotypes can be mapped as diploid species.

o Autopolyploids (i.e. chromosome sets derived from the same ancestral species) may have bivalent (i.e. two chromosome pairing) or multivalent pairing during meiosis. Identifying alleles, number of allele copies, and linkage phase is a challenge.

o Bivalent only two chromosomes pair during meiosis, segregating one chromosome to each set of gamete.

o Multivalent multiple chromosomes pair during meiosis, resulting in gametes with different combinations of the chromosomes.

. 52


Issue 4: PolyploidsMAPPING ALTERNATIVES

o Diploid relatives . Use diploid species that are related to polyploids of interest. However polyploidization is highly dynamic, not all polyploidsof interest have diploid relatives and breeding polyploid species is usually managed at the polyploid level.

o Single dose restriction fragments (Ritter et al., 1990). Use simplex parents (Mmmm) to produce gametes that segregates in a 1:1 ratio (Wu et al., 1992). Could be used with dominant or co-dominant markers.

o Multiple dose markers (Ripol et al., 1999).

o Doerge and Craig (2000) assumed complete preferential pairing (ok for Allopolyploids).

o Hacket et al. (2001) assumed random chromosome pairing (ok for extreme Autopolyploids).

o Ma et al. (2002) and Wu et al. (2004) incorporated a preferential pairing factor that depends on chromosome similarity.

o Cao et al. (2005) extended it to any even ploidy level.. 53

Issue 5: GxE Interaction

Genotype1 Genotype 2 Genotype 1 Genotype 2

ENV 1 ENV 2

E1 E2

G2

G1

R

Why QTLxE?

o Use appropriate residuals to test effects

o To identify main-effect QTL and environment-specific QTL

. 54

Module 1: Lecture 7b 1

Multiple Testingin QTL Mapping


Tucson Winter Plant Breeding Institute

1

Multiple Testing

2


Multiple Testing

3

Multiple Testing

WHAT IS WRONG WITH THE PREVIOUS CARTOON?A common practice is to use P<0.05 to decide about significance of a test. But with large number of tests (as is common in QTL mapping were we perform one test at each marker), the chance of having at least one false positive reaches 1 pretty quickly.

Let’s look at some theory to clarify this concept.

4


Hypothesis Testing (in QTL context)

OUTCOMES OF A STATISTICAL TEST

FALSE POSITIVE: occurs when a QTL is incorrectly declared presentTRUE POSITIVE: occurs when a QTL is correctly declared present.

FALSE NEGATIVE: occurs when a QTL is incorrectly declared absent.TRUE NEGATIVE: occurs when a QTL is correctly declared absent.

5

“THE TRUTH”

RESULT FROM TEST

Reject H0

Do notreject H0

H0 is True(i.e. no QTL)

H0 is False(i.e. a QTL)

False Positive

True Positive

True Negative

False Negative

B5

Type I error: It is the incorrect rejection of a true null hypothesis (an incorrectly declared QTL) in a given test. Probability of Type I error ( α): It is the probability of finding a false positive (an incorrectly declared QTL) in a given test.

Type I Error

6

“THE TRUTH”

RESULT FROM TEST

Reject H0

Do notreject H0



False Positive

True Positive

True Negative

False Negative

B5


Type II error: It is the failure to reject a true null hypothesis (when there is a QTL and we fail to declare it) in a given test. Probability of Type II error ( β): It is the probability of finding a false negative (failure to declare a true QTL) in a given test.

Type II Error

7

“THE TRUTH”

RESULT FROM TEST

Reject H0

Do notreject H0



False Positive

True Positive

True Negative

False Negative

B5

NOTE that for a given test, decreasing the false positive rate means that the power (the proportion of true positives, or 1- β) is also decreased. The only way to decrease false positives and increase the power is by changing your design (i.e. increasing the population sizes, reducing experimental error, etc.).

Type I and II Error

8

“THE TRUTH”

Reject H0

Do notreject H0



False Positive

True Positive

True Negative

False Negative

B5

“THE TRUTH”

RESULT FROM TEST

Reject H0

Do notreject H0



False Positive

True Positive

True Negative

False Negative


Multiple Testing

Let’s now assume that we are not interested in a single hypothesis testing, but we are conducting multiple hypothesis testing. For example, we are performing one hypothesis testing on each marker (at each marker we ask the question about whether the marker is associated to a QTL or not). We could summarize the information of the number of hypothesis that follow each category in the following chart:

9

Benjamini and Hochberg, 1995

Decision

“Truth” Do not reject H0 Reject H0 Total

H0 is true U V m0

H0 is false Z S m1=m-m0

Total m-R R m

Multiple Testing

Some more definitions:Per-family error rate (PFER): expected number of false positives: E(V).Per-comparison error rate (PCER): proportion of false positives in the total: E(V)/m.Family-wise error rate (FWER): probability of at least one false positive: P(V≥1). False Discovery Rate (FDR): proportion of false positives among rejected: E(V/R).

We could be interested in controlling the probability of having at least one false positive test (using family control). Or we could be interested in correcting the results by the proportion of false discoveries among the reject tests (using FDR).

10


Decision


H0 is true U V m0


Total m-R R m


Multiple Testing – Family control

If we aim to control FWER, it is necessary to use a somewhat smaller alpha value. But how small should it be? There are many methods that aim to control FWER. We will discuss two common methods that have been used in QTL mapping.

Bonferroni:Since the increase in the error is related to the number of independent tests, Bonferroni proposed to use a new alpha value as threshold:

However, many studies have shown that a Bonferroni correction for QTL studies is overly conservative mainly because tests are not independent (i.e. markers are linked and therefore not-independent). Having an unnecessarily stringent threshold reduces power to detect true QTL as has been shown.

11

( )

m

m

ααα

≅

−−=1

11*

Benjamini and Hochberg, 1995; Li and Ji, 2005


Li and Ji (2005):An alternative is to use a Bonferroni correction but instead of using the total number of tests (which we know are not independent because markers are correlated), is to use the effective number of independent tests. This idea was first proposed by Cheverud (2001) and then modified by Li and Ji(2005). The steps are:1. Calculate the correlation matrix of markers.2. Use the number of significantly different from zero eigenvalues (λi) of the

correlation matrix to determine the effective number of independent tests (Meff) as:

3. Use a Bonferroni-type of correction to determine the threshold but using the effective number of independent tests instead of the total number of tests:

12

Li and Ji, 2005

( )

Meff

Meff

ααα

≅

−−=1

11*

( )( ) ( ) ( ) 0,1

1

≥−+≥=

= ∑ =

xxxxIxf

fMM

i ieff λ



Li and Ji (2005):This method have been shown to be better than the Bonferroni correction and the Cheverud (2001) method. It performs equally good as permutation but is fast and simple to perform.

13

Li and Ji, 2005

Multiple Testing – False Discovery

The False Discovery Rate (FDR) is the proportion of false positives amongst the rejected hypothesis: E(V/R).

We are looking at the problem from a different perspective: out of the totaltests that we reject, how many aretrue positives?

14


Decision


H0 is true U V m0


Total m-R R m

Reject H0

Do notreject H0



False Positive

True Positive

True Negative

False Negative


Multiple Testing – False Discovery

False Discovery Rate (FDR): The steps to use the False Discovery Rate are as follow:1. Order the observed p-values: p(1) ≤ ... ≤ p(m)2. Calculate an arithmetic sequence as follows:

3. Reject all hypothesis where:

FDR is also conservative in most cases. Effective number of markers and a FDR using Meff instead of m could be used (Li and Ji, 2005).

15


( )

≤=

≤≤α

m

ipik i

mi:max

1

αm

i

Multiple Testing - Permutation

Permutations (Broman and Sen, 2009).

The steps involved are: 1. Randomize (shuffle) the phenotypes relative to the marker data.2. Perform a QTL mapping and obtain p-values (or LOD scores) and

keep the most extreme value (i.e. Maximum LOD or smallest p-value): Mi*

3. Repeat 1 and 2 several (r) times (i.e. 1,000 or 10,000).4. Produce the empirical distribution of the extreme values: M1*,... Mr*.5. Use the 95th percentile of Mi* values as the threshold.

This are computing-intensive but are precise for all cases.

16


Multiple Testing

17

Multiple Testing

SO WHAT WAS WRONG WITH THE PREVIOUS CARTOON?A common practice is to use P<0.05 to decide about significance of a test. But with large number of tests the chance of having at least one false positive is close to 1.

LET’S LOOK AT OUR OPTIONS IN HYPOTHESIS TESTING

1. Family Control : Bonferroni multiple-testing protection1. P = 0.05 / number of tests (this is very conservative)2. P = 0.05 / effective number of tests, with the effective number

of tests estimated from the marker data (Li and Ji, 2005).

2. False Discovery Rate (Benjamini and Hochberg, 1995).

3. Permutations (Broman and Sen, 2009).

18

Lecture 8: Association Mapping


Module 1

1

• Genetic mapping• Linkage disequilibrium• Population structure• Complex trait dissection


2

Linkage Analysis: Family

20 cM interval could contain 200 or more genes

P1

F1

F2

×

P2

1 generation ofrecombination

QTLInterval

Only hundreds of markers are needed to capture the recentrecombination events but at the expense of lower resolution 3

Many generationsof recombination

Association mapping: Natural populations

Higher resolution mapping of causative loci relative to linkageanalysis, but potentially thousands to millions of genetic markersneed to be scored on the population 4

Resolution (bp)

Res

earc

h t

ime

(yea

r)

1 1 x 104 1 x 1071

5

Association mapping

Positional cloning

Recombinant inbred lines

Pedigree

Intermated recombinant inbreds

F2 / BC

Near-isogenic lines

All

ele

nu

mb

er

10

2

40

Yu and Buckler 2006. Curr. Opin. Biotechnol. 17:155-160

Linkage analysis vs. Association mapping

5

AssociationTests

• Evaluate whether SNPs associate withphenotype

• Natural populations

• Exploit extensive recombination

1.3m

1.5m

1.4m

1.8m

2.0m

2.0m

T A GA A

C G GA A

C G TA A

T A TC G

C G TA G

T G GA G6

Manhattan plot: summarize association mapping results

• Identify genomic regions associated with a phenotype• Fit a statistical model at each SNP in the genome• Fitted models are used to test H0: No association between

SNP and phenotype

Lipka7



8

Walsh

Linkage vs. Linkage disequilibrium

• Linkage = excess of parental gametes from aparticular parent

• Linkage disequilibrium = nonrandom distribution oflinkage phases in the population

9

AB/ab

Excess of parentalgametes

AB, ab

linkage

Ab/aB

Excess of parentalgametes Ab, aB

AB/ab


AB, ab

Ab/aB


Pool all gametes to estimate LD: AB, ab, Ab, aB equally frequent

No LD: random distribution of linkage phases

Walsh10

AB/ab


AB, ab

linkage

AB/ab


AB, ab

AB/ab


AB, ab

Ab/aB


Pool all gametes to estimate LD: Excess of AB, ab due to an excess of AB/ab parents

With LD, nonrandom distribution of linkage phase

Walsh11

D(AB) = freq(AB) - freq(A)*freq(B)If A and B are independent, then LD = 0. If LD ≠ 0, there is a correlation between A and B in the population. Its range is determined by the allele frequency (undesirable – needs to be normalized).

If a marker and QTL are linked, then the marker and QTL alleles are in LD in close relatives, generating a marker-trait association

The decay of D: D(t) = (1-c)t D(0)here c is the recombination rate and t is the number of random mating generations. Tightly-linked genes (small c) initially in LD can retain LD for long periods of time.

LD: Linkage disequilibrium

Walsh12

• The maximum value of D is a function of allele frequencies• For two biallelic loci, let p = Freq(A), q = Freq(B)• Dmax = min[p(1-q), (1-p)q] for D ≥ 0• Dmin = max[-pq, -(1-p)(1-q)] for D < 0

• Lewontin’s D′ (normalized D; 1964) defined as• D′ = D/Dmax for D ≥ 0• D′ = D/Dmin for D < 0

• Can also scale D by expressing it as the correlation r amongalleles• r = D/sqrt[p*(1-p)*q*(1-q)]• Under drift-mutation-recombination equilibrium,

E(r2) ~ 1/(1+4Nec)

Measures of LD

Walsh/Gore13

Measures of LD: D and D′

• D describes the difference between coupling and repulsiongamete frequencies

• Hedrick 1987. Genetics 117:331–341

• D captures information about allelic association and allelefrequencies

• D′ is preferred because it is normalized and thus rangesbetween 0 and 1

• D and D′ may be highly erratic with rare alleles and smallsample sizes

14

Measures of LD: r2

• r2 (0 to 1) is the squared value of Pearson’s correlationcoefficient

• Hill and Robertson 1968. Theor. Appl. Genet. 38:226–231

• r2 summarizes both recombinational and mutational histories,while D and D′ measures only recombination

• r2 is preferred in association studies because it is moreindicative of how markers might correlate with QTL

15

Gamete freq

AB 0.3

Ab 0.2

aB 0.4

ab 0.1

freq(A) = p = freq(AB) + Freq(Ab) = 0.5

freq(B) = q = freq(AB) + Freq(aB) = 0.7

freq(a) = 1- p = 0.5

freq(b) = 1- q = 0.3

Linkage-equilibrium value for AB = freq(A)*freq(B) = 0.35

DAB = Freq(AB) - Freq(A)*Freq(B) = -0.05

Dmin = max[-pq, -(1-p)(1-q)] = max(-0.35, -0.15) = -0.15

D′ = D/Dmin = -0.05/-0.15 = 0.33

r = D/sqrt[(p*(1-p)*q*(1-q)] = -0.05/sqrt(0.35*0.15) = -0.218

Examples of LD estimation

Walsh/Gore

r2 = D2/[p*(1-p)*q*(1-q)] = 0.0025/(0.35*0.15) = 0.0476

16

Linkage disequilibrium

1 2Complete Disequilibrium

Modified from Rafalski 2002. Curr. Opin. Plant Biol. 5:94–100

6 0

0 6

Locus 1

Lo

cus

2

D′ = 1r2 = 1

1 2Complete Equilibrium

3 3

3 3

Locus 1

Lo

cus

2

D′ = 0

r2 = 0

* Complete LD between sites* Same mutational history* Low mapping resolution

* Pattern implies recombinationregardless of mutational history* High mapping resolution

Modified from Gaut and Long 2003. Plant Cell 15:1502-1506 17

1 2Partial Disequilibrium

6 3

0 3

Locus 1

Lo

cus

2

D′ = 1r2 = 0.33

1 2Complete Equilibrium

4 4

2 2

Locus 1

Lo

cus

2

D′ = 0

r2 = 0

* Site 2 may be a relatively new mutationwithout recombination* Moderate mapping resolution

* Pattern implies recombination

Modified from Rafalski 2002. Curr. Opin. Plant Biol. 5:94–100 Modified from Gaut and Long 2003. Plant Cell 15:1502-1506

Linkage disequilibrium

18

r2 in Association Mapping

1 2

SNP Marker Causative Variant

5 3

0 2

Locus 1

Lo

cus

2

D′ = 1

r2 = 0.25

SNP Marker will explain 25% of the total QTL variation, but only 2.5%of the total phenotypicvariation. Need large sample size.

r2 > 0.80 is recommendedfor association studies

Causative variant explains 10% of total phenotypic variance

19

Visualize the extent of linkage disequilibrium betweenpairs of loci: r2 vs. physical distance

Remington et al. 2001. PNAS 98:11479-11484Maize Dwarf3 (d3) gene

On average, intragenic LD rates rapidly decline tonominal levels (r2 < 0.1) within 2 kb in diverse maize

20

The line fitted to the data minimizes the sum of the squared differences between r2 and its expected value, assuming recombination scales with physical distance

Purple triangles are the r2 values ofeach SNP relative to the peak SNP(indicated in red).

The blue vertical lines are –log10 P-values for SNPs that are statisticallysignificant for αT at 5% FDR, whereas the gray vertical lines are –log10

P-values for SNPs that are non-significant at 5% FDR.

Plot of GWAS results for α-tocopherol (αT) content inmaize grain and LD (r2) across the ZmVTE4 region

Lip

ka e

t al.

20

13

. G3

3:1

28

7-1

299

21

There are many evolutionary forces that can shapelevels of LD in a population

• Natural and artificial selection• Recombination rate• Genetic drift• Mutation rate• Population structure• Population expansion/bottleneck• Admixture• Mating system

Slatkin 2008 Nature Reviews Genetics 9:477-48522

Buckler and Gore 2007. Nat. Genet. 39:1056-1057

Non-coding sitesSynonymous sites

LD range in major crops

23



24

Linkage vs. Association

• The important distinction between linkage and association issubtle, yet very critical

• Marker allele M is associated with the trait if Cov(M,y) ≠ 0

• While such associations can arise via linkage, they can alsoarise via population structure

• Thus, association DOES NOT imply linkage, and linkage isnot sufficient for association

Walsh25

Gm+ Total % with diabetes

Present 293 8%

Absent 4,627 29%

When a population being sampled consists of several distinct subpopulations, marker alleles may provide information as to which group an individual belongs. If there are other risk factors in a group, this can create a false association between a marker and trait.

Example. The Gm (human immunoglobulin G; IgG) marker was thought (IgG has an insulin like effect) to be an excellent candidate gene for non-insulin-dependent diabetes in the high-risk population of Pima Indians in the American Southwest.

Walsh/Gore26

Population structure: case/control samples in association mapping

Initially a highly significant negative association was observed between the Gm+ haplotype and diabetes:

Problem: The presence/absence of this haplotype is also a very sensitive indicator of admixture with the Caucasian population. The frequency of Gm+ is around 67% in Caucasians (lower-risk diabetes population) as compared to <1% in full-heritage Pima.

Gm+ Total % with diabetes

Present 17 59%

Absent 1,764 60%

The association was re-examined in a population of Pima adults (over age 35) that were 7/8th (or more) full heritage. Now the association betweenGm+ haplotype and disease disappeared:

Walsh/Gore27

The Gm+ marker is a predictor of diabetes not because it is linked to causative gene(s) but because it predicts whether an individual is from a specific population. Admixed individuals with significant fraction of genes from Caucasian extraction have a lower chance of carrying the causative gene(s), while gene(s) increasing risk of diabetes are at a higher frequency in full-heritage Pima.

• One measure of population structure is given by Wright’s FSTstatistic (also called the fixation index)

• Essentially, this is the fraction of genetic variation due tobetween-population differences in allele frequencies

• Changes in allele frequencies can be caused by evolutionaryforces such as genetic drift, selection, and local adaptation

• Consider a biallelic locus (A, a). If p denotes overall populationfrequency of allele A,– then the overall population variance is p(1-p)– Var(pi) = variance in p over subpopulations– FST = Var(pi)/[p(1-p)]

Walsh/Gore

FST, a measure of population structure

28

Population Freq(A)

1 0.1

2 0.6

3 0.2

4 0.7

Assume all subpopulationscontribute equally tothe overall metapopulation

Overall freq(A) = p =(0.1 + 0.6 + 0.2 + 0.7)/4 = 0.4

Var(pi) = E(pi2) - [E(pi)]2 = E(pi

2) - p2

Var(pi) = [(0.12 + 0.62 + 0.22 + 0.72)/4] - 0.42 = 0.065

Total population variance = p(1-p) = 0.4(1-0.4) = 0.24

Hence, FST = Var(pi) /[p(1-p) ] = 0.065/0.24 = 0.27

Example of FST estimation

Walsh/Gore29

P1

P2

p=0.5q=0.5

p=0.5q=0.5

FST = 0

Graphical example of FST

HomozygousDiploid

No population differentiation

30Modified from Escalante et al. 2004. Trends Parasitol. 20:388-395

P1

P2

p=0.9q=0.1

p=0.25q=0.75

FST=0.43

HomozygousDiploid


Strong population differentiation

31Modified from Escalante et al. 2004. Trends Parasitol. 20:388-395

P1

P2

Modified from Escalante et al. 2004. Trends Parasitol. 20:388-395

p=1q=0

p=0q=1

FST = 1

HomozygousDiploid

Complete population differentiation


32

Unrooted neighbor-joining tree based on C.S. Chord distance (Cavalli-Sforza and Edwards 1967) based on 169nuclear SSRs. The key relates the color of the line to the chloroplast haplotype based on ORF100 and PS-IDsequences.

Garris et al. 2005. Genetics 169:1631-1638

Rice population structure

*Admixed individuals

FST = 0.25

FST = 0.43

33

Liu et al. 2003. Genetics 165:2117-2128

Phylogenetic tree for 260 inbred lines using the log-transformed proportion of shared alleles distance

Maize population structure

Non-Stiff Stalk

Tropical/Subtropical

Stiff-Stalk

Teosinte

FST = 0.18

Flint-Garcia et al. 2005. Plant J. 144:1054-1064

FST = 0.22

34



35

y = Xb + Sa + Qv +Zu + e

Methodology development

Genomic technology

r2

Association mapping

p = 1e-7

Zhu et al. 2008. Plant Gen. 1:5-20

Genetic diversity

36

Construction of association mapping panels

http://shop.nativeseeds.org/products/don

http://www.oardc.ohio-state.edu/vanderknaap/caps_project.php http://b4fa.org/biosciences-and-agriculture/plantbreeding/genetic-diversity/

http://en.wikipedia.org/wiki/List_of_culinary_vegetables

37

Phenotyping association mapping panels

38

Photo from T. Rocheford

White/yellow seed color is controlled by 1 gene in maize

Yellow-orange gradient in seed color is controlled by 11 genes in maize

39

Quantitative resistance to northern leaf blight is controlled by at least 26 loci in maize

Flowering time is controlled by at least ~40 loci in maize• Buckler et al. 2009 Science 325:714-718

Traits such as biomass or grain yield are expected to besubstantially more complex

Poland et al. 2011. PNAS 108:6893-6898

40



Genomic technology

r2

Association mapping

p = 1e-7


Genetic diversity

41

Molecular Diversity I: Genome to Genes

Genome Size

Ploidy LevelChromosome Number

Number of Genes

J. Poland

Arabidopsis 2n=2x=10 ~134 MbRice 2n=2x=24 ~420 MbSorghum 2n=2x=20 ~760 MbMaize 2n=2x=20 ~2,500 Mb Barley 2n=2x=14 ~4,900 MbCotton 2n=4x=52 ~2,500 Mb Oat 2n=6x=42 ~11,300 MbWheat 2n=6x=42 ~16,500 Mb

Arabidopsis 27,379Rice 35,679Sorghum 34,496Cotton 37,505Maize 32,540

http://www.gencodys.eu/Patient%20entry%20.php http://chibba.agtec.uga.edu/duplication/

42

Molecular Diversity II: Nucleotide variants in a population

Single-Nucleotide Polymorphism (SNP)

…TGAACCTAAGTATGTCCG…



…TGAACCTAGGTATGTCCG…

…TGAACCTAGGTATGTCCG…

…TGAACCTAGGTATGTCCG…A/G

SNP allele

Line 1

Line 2

Line 3

Line 4

Line 5

Line 6

43

The number of markers needed for association mapping with complete genome-wide coverage (r2 ≥ 0.8) depends on the following population, genomic, and genetic architecture parameters:

• Genome size• Rate of LD decay• Nucleotide diversity levels• Causative variant effect sizes

• Arabidopsis – ~200,000 SNPs

• Grape – ~2,000,000 SNPs

• Diverse maize – ~20,000,000 SNPs44

The costs of DNA sequencing have rapidly decreased

45

https://www.genome.gov

Doubling of “computer power” every two years

46https://www.genome.gov

Constructing a Reference Genome Sequence

Whole-Genome Resequencing

47The International HapMap Project

(a)SNPs are identified in DNA samples from multiple individuals.

(b)Adjacent SNPs that are inherited together are compiled into haplotypes.

(c) “Tag” SNPs are identified within haplotypes that uniquely describe those haplotypes

Three steps of HapMap construction

Poland and Rife 2012. Plant Gen. 5:92-102

Genotyping-by-sequencing (GBS): massive parallel sequencing of multiplex reduced-representation libraries

48

FILLIN first tries to impute the entire site window with one (1a) or two (1b) haplotypes (using the Viterbi HiddenMarkov Model [HMM] to model the recombination break point), then if that is unsuccessful tries to impute forsmaller windows, first with one haplotype (2a), then two with Viterbi (2b), finally by combining two haplotypes tomodel heterozygosity (2c). If this does not satisfy (lower) error thresholds, the smaller window is not imputed.Dashed arrows mean that the algorithm continues if conditions are not satisfied for imputation. Beagle v. 4 is stillpreferable for diverse heterozygous populations.

Fast Inbred Line Library ImputatioN (FILLIN) algorithm: rapidly and accurately impute missing genotypes in low-coverage, GBS-type data with ordered markers

Swarts et al. 2014 Plant Gen. 7:1-1249

Haplotypelibrary via clustering

Recombination breakpoint

Smaller windows

Genetic relatedness



Genomic technology

r2

Association mapping

p = 1e-7


Genetic diversity

50

Pop

ulat

ion

stru

ctur

e

Familial relatedness

Yu et al. 2006. Nat. Genet. 38: 203-208Modified slide from Ed Buckler

Axi

s de

pict

s re

latio

nshi

ps a

mon

g m

ajor

sub

popu

latio

ns

asso

ciat

ed w

ith lo

cal a

dapt

atio

n or

div

ersi

fyin

g se

lect

ion

Axis depicts the relationships among individuals from recent coancestry

Maize

Rice

Human admixture

CEPH grandparents CEPH Utah family

Different types of samples used for association mapping

51

• (Line1,…, Linen) ~ MVN(0, , )

• K = kinship matrix

• εi ~ i.i.d. N(0, )

Phenotype of ith

individual

Grand Mean

Fixed effects: account for population structure

Marker effect

Observed SNP alleles of ith

individual

Random effects: account for familial relatedness

Random errorterm

Measures relatedness between individuals

Mixed linear models are used to reduce false positives in association mapping

Yu et al. 2006. Nat. Genet. 38:203-208 Lipka52

Q (population structure) + K (relatedness)

Yu et al. 2006. Nat. Genet. 38:203-208

Mod

el C

ompa

rison

53



Genomic technology

r2

Association mapping

p = 1e-7


Genetic diversity

54

Huang et al. 2010. Nat. Genet. 42:961-967

Association mapping in 373 indica rice lines with nearlyone million SNPs: traits with a weak correlation withpopulation structure

qsw5

55

Association mapping in 373 indica rice lines with nearlyone million SNPs: traits with a strong correlation withpopulation structure

Huang et al. 2010. Nat. Genet. 42:961-96756

Low resolution

Small reference

population & allele

numbers

Balanced allele

frequency

Known population

structure

High resolution

Large reference

population & allele

numbers

Rare alleles

Cryptic population

structure

Linkage analysis vs. Association mapping

57

The maize Nested Association Mapping (NAM) panel

58

P1

P2

P25

×

B73

Pop1

Pop2

Pop25

.

.

.

.

.

.

.

.

.

5,000 RIL Linkage Map

Linkage resolution Linkage resolution

The 5,000 RILs are genotyped with 14k GBS SNPmarkers for NAM joint linkage

59

Pop1

Pop2

Pop25

5,000 RIL Linkage Map

P1

P2

P25

B73

×

.

.

.

.

.

.

.

.

.

NAM resolutionNAM resolution

Whole-genome resequencing of parents and impute30M SNPs onto recombination blocks

60

Yu et al. 2008. Genetics 178:539–551

QTL detection power: NAM panel size, heritability, and number of QTL

61

Tian et al. 2011. Nat. Genet. 43:159–162

Joint-Linkage/Association mapping of leaf traits in themaize NAM panel

62


Module 1

Lecture 9: Inbreeding and crossbreeding

1


2

• Inbreeding theory• Heterosis theory• Maize heterotic groups• Genetic models of heterosis• Genetic basis of heterotic loci• Crossing schemes to reduce loss of heterosis

• Inbreeding = mating of related individuals

• Often results in a change in the mean of a trait

• Inbreeding is intentionally practiced to:– Create genetic uniformity of laboratory stocks– Produce stocks for crossing (animal and plant breeding)

• Inbreeding is unintentionally generated:– By keeping small populations (such as is found at zoos)– During selection within animal and plant breeding programs

Inbreeding

3

Walsh

• The inbreeding coefficient, F

• F = Prob(the two alleles within an individual are IBD) ―identical by descent

• Hence, with probability F both alleles in an individual areidentical, and hence a homozygote

• With probability 1-F, the alleles are combined at random

4

Genotype frequencies under inbreeding

Walsh

Genotype Alleles IBD Alleles not IBD frequency

A1A1 Fp (1-F)p2 p2 + Fpq

A2A1 0 (1-F)2pq (1-F)2pq

A2A2 Fq (1-F)q2 q2 + Fpq

p A1

A2q

F

F

A1A1

A2A2

p

p A1A1

A2 A1

q

A2A1q

A2A2

Alleles IBD

1-F

1-F

Random mating

Alleles IBD

5

Walsh

mF = m0 - 2Fpqd

Using the genotypic frequencies under inbreeding, the population mean mF under a level of inbreeding F is related to the mean m0 under random mating by

6

Changes in the mean under inbreeding

Walsh

• There will be a change in mean value if dominance is present (d not 0)

• For a single locus, if d > 0, inbreeding will decrease the mean value ofthe trait. If d < 0, inbreeding will increase the mean.

• For multiple loci, a decrease (inbreeding depression) requires directionaldominance ― dominance effects di tending to be positive

• The magnitude of the change of mean on inbreeding depends on allelefrequency, and is greatest when p = q = 0.5 7

Walsh

Example for maize height

F1 F2 F3 F4 F5

8

Inbreeding depression

F6 F7

Walsh

9

Fitness traits and inbreeding depression

• Often seen that inbreeding depression is strongest on fitness-relative traits such as yield, height, etc.

• Traits less associated with fitness often show less inbreedingdepression

• Selection on fitness-related traits may generate directionaldominance

Walsh

10

Inbreeding depression in selfing lineages?

• Inbreeding depression is common in outcrossing species

• However, generally fairly uncommon in species with a highrate of selfing

• One idea is that the constant selfing has purged many of thedeleterious alleles thought to cause inbreeding depression

• However, lack of inbreeding depression also means a lack ofheterosis

Walsh

Inbreeding reduces variation within each population

Inbreeding increases the variation between populations(i.e., variation in the means of the populations)

F = 011

Variance changes under inbreeding

Walsh

F = 1/4

F = 3/4

F = 1

Between-group variance increases with F

Within-group variance decreases with F12

Walsh

• A series of inbred lines from an F2 population areexpected to show:

– more within-line uniformity (variance about the meanwithin a line)

• Less within-family genetic variation for selection

– more between-line divergence (variation in the meanvalue between lines)

• More between-family genetic variation for selection

13

Implications for traits

Walsh

General F = 1 F = 0

Between lines 2FVA 2VA 0

Within lines (1-F) VA 0 VA

Total (1+F) VA 2VA VA

The above results assume ONLY additive variance i.e., no dominance/epistasis. When non-additive variance is present, the results become very complex (see WL Chp. 3).

14

Variance changes under inbreeding

Walsh


15


Springer and Stupar 2007. Genome Res. 17:264-275

Heterosis (or hybrid vigor) is one of the leastunderstood universal biological phenomena that hasbeen exploited by animal and plant breeders toincrease the productivity of domesticated species

http://www.cmbb.arizona.edu/?page_id=15indica japonicaindica japonica

16

Heterosis is the increased vigor, growth, size, yield, or function of hybrid progeny over the parents that results from crossing genetically unlike organisms

(1) Mid-parent (MP) heterosis = (F1 – MP)/MP × 100 = % MPH

(2) High-parent (HP) heterosis = (F1 – HP)/HP × 100 = % HPH

(3) Absolute heterosis = F1 – MP

17

Springer and Stupar 2007. Genome Res. 17:264-275

Maize phenotypes that show heterosis

(High-parent heterosis)

18

Yuan Longping

The average yield of hybrid rice is 7.2 t/ha, while other inbred varieties yield 5.9 t/ha. It is estimated that approximately 70million more people annually in China can be fed by planting hybrid rice

Father of Hybrid Rice: Longping developed first hybrid rice in China as a university professor in 1970s

http://english.cri.cn/8706/2013/06/05/3262s768613.htm

https://www.worldfoodprize.org/en/laureates/20002009_laureates/2004_jones_and_yuan/

19

When inbred lines are crossed, the progeny show an increase in mean for characters that previously suffered a reduction from inbreeding

This increase in the mean over the average value of theparents is called hybrid vigor or heterosis

A cross is said to show heterosis if H > 0, so that the F1 mean is larger than the average of both parents

20

Line crosses: Heterosis

Walsh

Expected levels of heterosis between two lines

• Heterosis depends on dominance: If d = 0, then there is neither inbreedingdepression nor heterosis. As with inbreeding depression, directional dominance (d > 0) is required for heterosis.

• H is proportional to the square of the difference in allele frequencies betweenparental lines from different populations. H is greatest when alleles are fixed in one population and lost in the other (so that |δpi| = 1). H = 0 if δp = 0.

• H is specific to each particular cross. H must be determined empirically, since weneither know the relevant QTL nor their allele frequencies.

21

Walsh/Gore

δpi is the allele frequency difference between two crossed lines at a locus with two alleles

In the F1, all offspring are heterozygotes. In the F2, random mating has occurred, reducing the frequency of heterozygotes.

As a result, there is a reduction of the amount of heterosis in the F2 relative to the F1,

If random mating occurs in the F2 and subsequentgenerations, the level of heterosis stays at the F2 level.

22

Walsh

Heterosis declines in the F2

Crop % planted as hybrids

% yield advantage

Annual added yield:

%

Annual added yield:

tons

Annual land savings

Maize 65 15 10 55 x 106 13 x 106 ha

Sorghum 48 40 19 13 x 106 9 x 106 ha

Sunflower 60 50 30 7 x 106 6 x 106 ha

Rice 12 30 4 15 x 106 6 x 106 ha

Crosses often show high-parent heterosis, wherein the F1 not only beats the average of the two parents (mid-parent heterosis), it exceeds the best parent.

23

Agricultural importance of heterosis

Walsh

Hybrid corn in the US

• Shull (1908) suggested objective of corn breeders shouldbe to find and maintain the best parental lines for crosses

• Initial problem: early inbred lines had low seed set

• Solution (Jones 1918): use a hybrid line as the seedparent, as it should show heterosis for seed set

• 1930’s - 1960’s: most corn produced by double crosses

• Since 1970’s, hybrids are mostly from single crosses

24

Walsh

A cautionary tale

• 1970-71 the plant disease Southern Corn Leaf Blightalmost destroyed the whole US corn crop

• Much larger (in terms of food energy) than the great potatoblight of the 1840’s

• Cause: Corn can self-fertilize, so to make hybrids eitherhave to manually detassel the pollen structures or usegenetic tricks that cause male sterility

• Almost 85% of US corn in 1970 had Texas cytoplasm Tcms,a mtDNA encoded male sterilty gene

• Tcms turned out to be hypersensitive to “race T” of thefungus Helminthosporium maydis. Billion dollar losses!

25

Walsh


26


http://thescientistgardener.blogspot.com/2010/12/maize-is-machine.html

Combination of genetics and agronomic technologies has allowed a 1% or better gain per year in grain yields

Phillips 2010. Crop Sci. 50:S-99-S-108

27

Northern Flint Modern CBD Southern Dent

Photo: http://thescientistgardener.blogspot.com/2010/12/maize-is-machine.html

ReidYellow Dent OPV

LancasterSurecrop OPV

The making of corn belt dent

28

• Corn belt dent heterotic patterns are not the result ofhistorical or geographical influences

Tracy and Chandler 2006. pp 219–233

• Open-pollinatedvarieties and firstcycle inbreds didnot show heteroticpatterns whencrossed; therefore,markers would nothave been helpfulto identify heteroticgroups

The formation of maize heterotic groups

29

• Corn belt dent heterotic patterns were created bybreeders through trial and error

• In the 1940s, breeders started arbitrarily splitting thegermplasm pool into groups (odd vs. even numberedlines)

• Genetic drift created initial divergence in allelefrequencies, which was enhanced by selection

• Modern heterotic groups are the product ofdivergence from a homogenous landrace (OPV)population

Tracy and Chandler 2006. pp 219–233 van Heerwaarden et al. 2012. PNAS 109:12420-12425

The formation of maize heterotic groups

30


31


Genetic hypotheses of heterosis

• Dominance hypothesis – masking of unfavorable recessivealleles in a heterozygote. Two or more loci are neededbecause the value of a heterozygote at a single locus (d>a)does not exceed the value of the superior parent.

• If true, it should be possible to obtain an inbred thatperforms equally as well as the best hybrid

• Overdominance hypothesis – the heterozygote is superiorover either homozygote. Only a single locus (d>a) isneeded to achieve heterosis. Also, linkage is not needed toachieve heterosis.

• If true, it should NOT be possible to obtain an inbred thatperforms equally as well as the best hybrid

Bernardo 2002. Breeding for Quantitative Traits in Plants pp 243-246

32

• Pseudo-Overdominance hypothesis – repulsion phaselinkage of loci that show partial or complete dominance

• The effects of two loci are difficult to separate if bothare tightly linked. If we did not know that two locicomprise a single linkage block, we would incorrectlyconclude that heterosis is due to overdominance.

• Pseudo-overdominance is similar to the two-locusdominance hypothesis, with the exception thatrepulsion phase linkage is required for pseudo-overdominance.

Genetic hypotheses of heterosis

Bernardo 2002. Breeding for Quantitative Traits in Plants pp 243-246

33

Genetic models for heterosis

Repulsion Phase Linkage –Superior A and B alleles createa superior phenotype from complementation

Allelic interactions –Heterozygosity at the B locuswith two functional alleles

Complementation –Slightly deleterious homozygous a, b, c alleles

Birchler et al. 2006. PNAS 103:12957-12958

34

Epistasis appears to have a major role in the genetic basis of heterosis in an elite rice hybrid

Yu et al. 1997. PNAS 94:9226-9231

The two indica rice lines, Zhenshan 97 and Minghui 63, are parents of Shanyou 63, one of the nost productive hybrids in China

Additive × Additive (AA); Additive × Dominance (AD); Dominance × Additive (DA); Dominance × Dominance (DD) 35


36


Hill-Robertson effect

• H-R effect: linkage between sites under selectionreduces the overall effectiveness of selection for finitenatural populations

• Repulsion phase linkages among favorable alleles willreduce the effectiveness of selection

• Favorable alleles have a higher chance of being inrepulsion phase in the presence of low recombination

• If these favorable alleles exhibit dominance, then lowrecombination regions should be under high selectivepressure to maintain heterozygosity

Hill and Robertson 1966. Genet. Res. 8:269-294 McMullen et al. 2009. Science 325:737-740

37

-w

ithin

10

cM

on

ea

ch s

ide

o

f th

e c

en

tro

me

re p

osi

tion

-th

e r

est

of

the

ch

rom

oso

me

re

gio

ns

Residual heterozygosity is higher in pericentromeric regions and inversely correlated with recombination rate

Gore et al. 2009. Science 326:1115-1117

Hypothesized that these data support the dominance theory of heterosis andthe increase in heterozygosity near centromeres is a consequence of heterosis.This heterosis is likely the product of pseudo-overdominance, most pronouncedin pericentromeric regions, where the Hill-Robertson effect is strongest.

McMullen et al. 2009. Science 325:737-740

38

Fine mapping of a heterotic QTL for maize grain yield

Overdominant QTL on chr. 5 was dissected by NILs into twotightly, linked dominant effect QTL in repulsion phase. Providesevidence for pseudo-overdominance.

Graham et al. 1997. Crop Sci. 37:1601-1610

39

Meta-QTL analysis of heterosis

Concluded pseudo-overdominance is a major cause ofheterosis in maize and no significant epistasis. Heterotic QTLfor grain yield mapped near low R, pericentromeric regions (i.e.,likely repulsion phase)

Schön et al. 2010. Theor. Appl. Genet. 120:321-332

40

Krieger et al. 2010. Nat. Genet. 42:459-463

Overdominance: heterozygosity for tomato loss-of-function alleles of SINGLE FLOWER TRUSS (SFT), which is the genetic originator of the flowering hormone florigen, increases yield by up to 60%

Yield overdominance from SFT heterozygosity is robust, occurring in distinct genetic backgrounds and across multiple environments

Heterotic yield effects derive from a dosage-dependent suppression of growth termination mediated by SELF PRUNING (SP), an antagonist of SFT

41

Het have more inflorescences


42

• Inbreeding theory• Heterosis theory• Maize heterotic groups• Genetic models of heterosis• Genetic basis of heterotic loci• Crossing schemes to reduce loss of

heterosis

Take n lines and construct an F1 population by making all pairwise crosses

43

Crossing Schemes to Reduce the Loss of Heterosis: Synthetics

Walsh

• Major trade-offs– As more lines are added, the F2 loss of heterosis declines

– However, as more lines are added, the mean of the F1also declines, as less elite lines are used

– Bottom line: For some value of n, F1 - H/n reaches amaximum value and then starts to decline with n

44

Synthetics

Walsh

• The F1 from a cross of lines A x B (typically inbreds) is called asingle cross

• A three-way cross (also called a modified single cross) refersto the offspring of an A individual crossed to the F1 offspring ofB x C.– Denoted A x (B x C)

• A double (or four-way) cross is (A x B) x (C x D), the offspringfrom crossing an A x B F1 with a C x D F1

45

Walsh

Types of crosses

• While single cross (offspring of A x B) are hard to predict,three- and four-way crosses can be predicted if we know themeans for single crosses involving these parents

• The three-way cross mean is the average mean of the twosingle crosses:– mean(A x {B x C}) = [mean(A x B) + mean(A x C)]/2

• The mean of a double (or four-way) cross is the average of allthe single crosses,– mean({A x B} x {C x D}) = [mean(AxC) + mean(AxD) +

mean(BxC) + mean(BxD)]/4

46

Walsh

Predicting cross performance

1



Lecture 10 Mass, Family, and Line

Selection

2

Topics •  Breeder’s equation

–  Outcrossing population (h2) –  Clones (H2) –  Truncation selection, selection intensity –  Permanent vs. transient response

•  Family selection –  Different types of family selection

•  Selfing –  Selection while selfing

•  Line selection –  SSD, Pedigree, Bulk, and DH schemes –  Early vs. late testing

3

Selection •  Basic goal is to develope elite genotypes

–  With an outcrossed population, this is done by increasing the frequency of favorable alleles

•  Selection on additive (A), rather than interaction (D), effects

•  In a large population, continual improvement (response) expected over a number of generations

–  With inbred populations, this is done by generating a series of inbreds from a cross and picking the elite lines.

•  Selection on genotypic values, hence additive + interaction effects

•  Further progress depends upon generating new variation through additional crosses

4

Outcrossed Populations •  Improvement in an outcrossed (or open pollinated)

population akin to what animal breeders do for improvement (called recurrent selection by plant breeders) –  A within-generation change (the increase in trait mean

among the selected individuals) is translated into a between-generation change.

•  Individuals are chosen either on the basis of –  their phenotypic value (mass selection) –  the performance of their offspring (progeny testing) or

relatives (such as sib or family selection) –  Idea is to use such information to obtain estimates of

the breeding values of individuals

5

Response to Selection

•  Selection can change the distribution of phenotypes, and we typically measure this by changes in mean –  This is a within-generation change, namely the

selection differential S = µ* - µ •  Selection can also change the distribution of

breeding values –  This is the response to selection, the change in

the trait in the next generation (the between-generation change) R(t) = µ(t+1) - µ(t)

6

The Breeder’s Equation: Translating S into R Recall the regression of offspring value on midparent value

Averaging over the values of the selected midparents, E[ (Pf + Pm)/2 ] = µ*,

E[ yo - µ ] = h2 ( µ� - µ ) = h2 S

Likewise, averaging over the regression gives

Since E[ yo - µ ] is the change in the offspring mean, it represents the response to selection, giving:

R = h2 S The Breeder’s Equation (Jay Lush)

7

•  Note that no matter how strong S, if h2 is small, the response is small

•  S is a measure of selection, R the actual response. One can get lots of selection but no response

•  If offspring are asexual clones of their parents, the breeders’ equation becomes –  R = H2 S

•  If males and females subjected to differing amounts of selection, –  S = (Sf + Sm)/2

8

Pollen control •  Recall that S = (Sf + Sm)/2 •  An issue that arises in plant breeding is pollen

control --- is the pollen from plants that have also been selected?

•  Not the case for traits (i.e., yield) scored after pollination. In this case, Sm = 0, so response only half that with pollen control

•  Tradeoff: with an additional generation, a number of schemes can give pollen control, and hence twice the response –  However, takes twice as many generations, so

response per generation the same

9

Selection on clones •  Although we have framed response in an outcrossed

population, we can also consider selecting the best individual clones from a large population of different clones (e.g., inbred lines)

•  R = H2S, now a function of the board sense heritability. Since H2 > h2, the single-generation response using clones exceeds that using outcrossed individuals

•  However, the genetic variation in the next generation is significantly reduced, reducing response in subsequent generations –  In contrast, expect an almost continual response for several

generations in an outcrossed population.

10

The Selection Intensity, i The selection intensity i is the selection differential expressed in terms of phenotypic standard deviations

Consider two traits, one with S = 10, the other S = 5. Which trait is under stronger selection? Can’t tell, because S is a function of the phenotypic variance of the trait

In contrast, i is scaled measure and hence allows for fair comparisons over different traits

11

Truncation selection •  A common method of artificial selection is through

selection --- all individuals whose trait value is above some threshold (T) are chosen.

•  Equivalent to only choosing the uppermost fraction p of the population

12

Truncation selection •  The fraction p saved can be translated into an

expected selection intensity (assuming the trait is normally distributed), –  allows a breeder (by setting p in advance) to

chose an expected value of i before selection, and hence set the expected response

–  –  –  R code for i: dnorm(qnorm(1-p))/p

p 0.5 0.2 0.1 0.05 0.01 0.005

i 0.798 1.400 1.755 2.063 2.665 2.892

Height of a unit normal at the threshold value corresponding to p

13

Selection Intensity Versions of the Breeders’ Equation

R = h 2 S = h 2 S σp

σp = i h 2 σp

Since h2σP = (σ2A/σ2

P) σP = σA(σA/σP) = h σA

R = i h σA

Since h = correlation between phenotypic and breeding values, h = rPA

R = i rPAσA

Response = Intensity * Accuracy * spread in Va

When we select an individual solely on their phenotype, the accuracy (correlation) between BV and phenotype is h

14

Accuracy of selection More generally, we can express the breeders equation as

R = i ruAσA

Here we select individuals based on the index u (for example, the mean of n of their sibs).

ruA = the accuracy of using the measure u to predict an individual's breeding value = correlation between u and an individual's BV, A

15

16

Improving accuracy •  Predicting either the breeding or genotypic

value from a single individual often has low accuracy --- h2 and/or H2 (based on a single individuals) is small –  Especially true for many plant traits with high G x E –  Need to replicate either clones or relatives (such as

sibs) over regions and years to reduce the impact of G x E

–  Likewise, information from a set of relatives can give much higher accuracy than the measurement of a single individual

17

Stratified mass selection •  In order to accommodate the high

environmental variance with individual plant values, Gardner (1961) proposed the method of stratified mass selection –  Population stratified into a number of different

blocks (i.e., sections within a field) –  The best fraction p within each block are chosen –  Idea is that environmental values are more similar

among individuals within each block, increasing trait heritability.

18

Family selection •  Low heritabilites of traits a major issue

–  High G x E, esp. year-to-year –  Single phenotypes very poor predictors of A, G. –  Hence, often grow out relatives in field trails

(multiple plots over a range of regions and years -- better sampling of G x E)

•  Within- versus between-family selection •  Response under general family selection •  Lush’s family index

19

Different types of family-based selection

Uppermost fraction p chosen, m families each with n sibs, total N = mn

20

Within- vs. Between-family selection

Between-family response

Within-family response

21

Within, Between or Individual selection? Which scheme is best departs on the trait heritability and the intraclass correlation t among sibs

Between-family response > individual when

Low heritability, small common-family variance

22

Within-family response > individual when

Requires low heritability, and a large c. c2Var(z) = between family common variance thus accounts for much of the total trait variance

23

24

25

General response •  The basic idea is that a parent P generates a set of

relatives x1 through xn whose phenotypes are measured (selection unit or group)

•  Based on the performance of the selection group, relatives Ri of the best parents is chosen to represent Pi in forming the next generation (recombination unit or group) –  R can be

•  The parent itself •  Progeny (measured or unmeasured) of the parent, could

be half- or full-sibs to the selection group, could also be the seed from selfing the parent

•  Other relatives of P

26

P1

x1

y

P2

x21R 2R

Chooses parents

Crossed to make offspring

Response in next generation

Key: The covariance σ(xi,y) between a member in the selection group and the offspring is critical to predicting selection response, closely related to σ(xi,ARi), the cov bwt selection unit and BV of R

Expected response is the average breeding values of the Ri.

27

Response under general family selection

•  Recall the accuracy version of the breeder’s equation, R = i ruAσA

•  In our context, –  i is the selection intensity between selection units –  u is the value of the selection unit –  A the breeding value in the corresponding

member R of the recombination unit, σ2A the

additive variance in the recombination unit •  The correlation ruA is obtained from standard resemblance

between relatives calculations (full details in WL Chapter 19)

28

29

30

Offspring - selection unit covariances

31

Offspring - selection unit covariances (cont)

32

Variance in the selection unit

33

Response

34

Specific schemes: ear-to-row •  A common scheme in corn breeding is to plant the

seeds from an ear as rows –  Each row is thus a half-sib family (this is the

selection unit) –  Some seed from ear saved (these form the

recombination unit) –  Suppose total N seeds per ear grown as np rows

of ns sibs in ne environments

Family x E interaction

35

Modified ear-to-row

•  Lonnquist (1964) proposed ear-to-row •  Combines ear-to-row (between family) with

selection within row (within-family selection) •  Plant the seeds from the ear into two sets of

rows. One set is several rows over multiple environments. Select best ears based on this performance

•  Grow out residual seed from these best ears in a single row, then select best from each family within each row

36

Best Best

37

Response Total R = R from ear-to-row + R from within-row

38

Lush’s family index Finally, Lush suggested that an index weighting both within- and between-family values is optimal

The optimal weights are given by

Ratio of response/individual selection

> 1

39

1.00.90.80.70.60.50.40.30.20.10.00.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

n = 2 n = 4 n = 10 n = 25 n = 50

t

b2 /

b1

Half-sibs

More emphasis onwithin-family deviations

More emphasis onbetween-family deviations

40

1.00.90.80.70.60.50.40.30.20.10.00.00.1

0.20.30.40.5

0.60.70.8

0.91.01.11.2

1.31.4

n = 2 n = 4 n = 10 n = 25 n = 50

t

b2 /

b1

Full-sibs

More emphasis on within-family deviations

More emphasis onbetween-family deviations

41

42

43

Permanent Versus Transient Response

Considering epistasis and shared environmental values, the single-generation response follows from the midparent-offspring regression

Permanent component of response

Transient component of response --- contributes to short-term response. Decays away to zero

over the long-term

44

Permanent Versus Transient Response

The reason for the focus on h2S is that this component is permanent in a random-mating population, while the other components are transient, initially contributing to response, but this contribution decays away under random mating

Why? Under HW, changes in allele frequencies are permanent (don’t decay under random-mating), while LD (epistasis) does, and environmental values also become randomized

45

Response with Epistasis

c is the average (pairwise) recombination between loci involved in A x A

46

Response with Epistasis

Contribution to response from epistasis decays to zero as linkage disequilibrium decays to zero

Response from additive effects (h2 S) is due to changes in allele frequencies and hence is permanent. Contribution from A x A due to linkage disequilibrium

47

Why unselected base population? If history of previous selection, linkage disequilibrium may be present and the mean can change as the disequilibrium decays

More generally, for t generation of selection followed by τ generations of no selection (but recombination)

RAA has a limiting value given by

Time to equilibrium a function of c

48

What about response with higher-order epistasis?

Fixed incremental difference that decays when selection

stops

49

Response in autotetraploids

•  Autotraploids pass along two alleles at each locus to their offspring

•  Hence, dominance variance is passed along •  However, as with A x A, this depends upon

favorable combinations of alleles, and these are randomized over time by transmission, so D component of response is transient.

50

P-O covariance Single-generation response

Response to t generations of selection with constant selection differential S

Response remaining after t generations of selection followed by τ generations of random mating

Contribution from dominance quickly decays to zero

Autotetraploids

51

General responses •  As we have seen with both individual and family

selection, the response can be thought of as a regression of some phenotypic measurement (such as the individual itself or its corresponding selection unit value x) on either the offspring value (y) or the breeding value RA of the individual in the recombination group

•  The regression slope for predicting y from x is σ (x,y)/σ2(x) and σ (x,RA)/σ2(x) for predicting the BV RA from x

•  With transient components of response, these covariances now also become functions of time --- e.g. the covariance between x in one generation and y several generations later

52

Ancestral Regressions When regressions on relatives are linear, we can think of the response as the sum over all previous contributions

For example, consider the response after 3 gens:

8 great-grand parents S0 is there selection differential β3,0 is the regression coefficient for an offspring at time 3 on a great-grandparent From time 0

4 grandparents Selection diff S1 β3,1 is the regression of relative in generation 3 on their gen 1 relatives

2 parents

53

Ancestral Regressions

βT,t = cov(zT,zt)

More generally,

The general expression cov(zT,zt), where we keep track of the actual generation, as oppose to cov(z, zT-t ) -- how many generations Separate the relatives, allows us to handle inbreeding, where the (say) P-O regression slope changes over generations of inbreeding.

54

Selfing •  Finally, let’s consider selection when an F1 is formed

and then selfed for several generations to generate inbred lines

•  How best to advance lines to full inbreeding while selecting. –  Should advancement occur first, and the we can

chose among the result inbred lines –  Or should some selection (early testing) be

occurring while the lines are being inbred. •  First, we will consider the response with

simultaneous selfing and selection –  Inbreeding removes within-line variation,

enhances between-line variation

55

Details for computing these genetic covariances are given in Walsh & Lynch Chapter 20 (online)

56

Selection of the best pure lines •  A very common setting in plant breeding is when

two (or more) inbreds are crossed and the resulting F1 continually selfed to form a series of inbred lines

•  This is different from selecting elite lines among a set of already inbred lines, as the breeder also has to advance the lines to fully inbreds, which often takes time, in addition to trying to select the best ones. –  Most accurate measure of performance (given G x

E) are multiregional trails wherein a subset of the advancing lines are measured over a series of regions and years.

57

Advancing to full inbreds •  How best to combine advancing a line to being fully

inbred while still selecting (testing) them. –  Tradeoff between less accurate testing a large

number of lines (but more variation kept) vs. more accurate testing of a smaller number of lines (representing less variation)

•  Methods –  Single seed descent (SSD) –  Doubled haploids (DH) –  Bulk Selection –  Pedigree selection

58

SSD, DH selection •  Under SSD (single-seed descent), single

seeds are used to advance a series of lines to full inbreeding, then selection (choosing among them) occurs, usually through mulitregional trails

•  Single seeds are used to reduce any effects of selection during inbreed

•  Under DH (double haploids), inbred lines are formed in one generation –  Less chance for decay of any LD relative to SSD,

but effect likely to be small

59

Bulk Selection

•  Seeds from natural selfers are grown and harvest in bulk over multiple generations –  One problem is the natural selection during the

advancing of generations does select on yield (leaving more descendants) but also on other traits.

–  Often tall plants are naturally selected during the bulk over higher yielding short plants

60

Bulk selection response

Atlas wins, but Vaughn best from an agricultural standpoint (higher yield -- 107% of Atlas, earlier heading date, better disease resistance)

61

Pedigree selection •  Not to be confused with the pedigree-based

selection using BLUP (which very formally uses information for all relatives in selection decisions).

•  Under pedigree selection (aka pedigree breeding), as individuals become more inbred, selection decisions shift from individual plants towards family-based performance

•  High heritability traits selected early (individual selection), lower heritability traits (e.g., those with high G x E) selected later (family selection allowing for replication over different G’s)

62

Family-based selection

Individual selection

Individual selection

Family-based selection

Pedigree selection

63

Early generation testing •  Much debate on the effectiveness of early

generation testing •  Effectiveness depends on a high correlation

between phenotypes in the tested generations and the final genotypes of their selfed descendants –  Concerns: genotypes selected early can be

different from their descendant full inbred offspring

–  Basic idea: OK for high heritability traits –  Not so good for low-heritability traits

64

EGYT = early generation yield traits, BS = Bulk Selection, DH = Doubled haploids SSD = single seed descent, PS = Pedigree selection

Effectiveness of different methods

Date post:	25-Jan-2020
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

SYLLABUS INTRODUCTION TO PLANT QUANTITATIVE …1 Lecture 1 Introduction to Modern Plant Breeding...

Documents