SYLLABUS INTRODUCTION TO PLANT QUANTITATIVE GENETICS
Tucson, 8 – 10 Jan. 2018
INSTRUCTORS:
Mike Gore, Cornell University, [email protected] Lucia Gutierrez, Department of Agronomy, University of Wisconsin, Madison [email protected] Bruce Walsh, Department of Ecology & Evolutionary Biology, University of Arizona [email protected]
References B = Bernardo,Breeding for Quantitative Traits in Plants, 2nd ed. LW = Lynch & Walsh: Genetics and Analysis of Quantitative Traits (book) WL = Walsh & Lynch: Evolution and Selection of Quantitative Traits (website) http://nitro.biosci.arizona.edu/zbook/NewVolume_2/newvol2.html
LECTURE SCHEDULE Monday, 8 Jan 8:30 10:00 am 1. Introd to Modern Plant Breeding (Gore, Gutierrez, Walsh) Background reading: B Chapter 1 10:00 10:30 am Break 10:30 12:00 am 2. Basic Genetics (Walsh, Gore) Background reading: LW Chapter 4 12:00 1:30 pm Lunch 1:30 3:00 pm 3. Basic Statistics (Walsh) Background reading: LW Chapters 2, 3 Additional reading: LW Appendix A4 3:00 3:30 pm Break 3:30 5:00 pm 4. Allelic Effects and Genetic Variances (Walsh) Background reading: B Chapters 3, 6 Additional reading: LW Chapters 4, 5 Tuesday, 9 Jan 8:30 10:00 am 5. Resemblance Between Relatives (Walsh) Background reading: B Chapter 6
Additional reading: LW Chapter 7 10:00 10:30 am Break 10:30 12:00 am 6. Heritability and Field Designs (Gutierrez) Background reading: B Chapters 6, 7
Additional reading: LW Chapters 17, 18, 20, 22
Holland, J., W.E. Nyquist, C.T. Cervantes-Martinez. 2010. Estimating and Interpreting Heritability for Plant Breeding: An Update. Plant Breeding Reviews 22: 9-112.
12:00 1:30 pm Lunch 1:30 3:00 pm 7. QTL Mapping (Gutierrez) Background reading: B Chapter 5
Additional reading: LW Chapters 12-15 3:00 3:30 pm Break 3:30 5:00 pm 8. Association Mapping (Gore) Background reading: B Chapter 5.4
Additional reading: LW Chapter 16 Wednesday, 10 Jan 8:30 10:00 am 9. Inbreeding, Heterosis (Gore) Background reading: B Chapter 12
Additional reading: LW Chapter 10, 10:00 10:30 am Break 10:30 12:00 am 10. Mass and Family Selection (Walsh) Background reading: B Chapters 9, 10 Additional reading: WL Chapters 12, 13, 19, 20, 35
ADDITIONAL BOOKS ON QUANTITATIVE GENETICS
General Falconer, D. S. and T. F. C. Mackay. Introduction to Quantitative Genetics, 4th Edition
Lynch, M. and B. Walsh. 1998. Genetics and Analysis of Quantitative Traits. Sinauer.
Mather, K., and J. L. Jinks. 1982. Biometrical Genetics. (3rd Ed.) Chapman & Hall.
Plant Breeding
Wricke, G., and W. E. Weber. 1986. Quantitative Genetics and Selection in Plant Breeding. De Gruyter.
Mayo, O. 1987. The Theory of Plant Breeding. Oxford.
Stoskopf, N. C.. D. T. Tomes, and B. R. Christie. 1993. Plant breeding: Theory and practice. Westview, Boulder.
Sleper, D. A., and J. M. Poehlman. 2006. Breeding Field Crops. 5th Edition. Blackwell
Bernardo, R. 2010. Breeding for Quantitative Traits in Plants, 2nd Ed Stemma Press.
Hallauer, A. R., M. J. Carena, and J. B. Miranda Filho. 2010. Quantitative Genetics in Maize Breeding. Iowa State Press.
Statistical and Technical Issues Bulmer, M. 1980. The Mathematical Theory of Quantitative Genetics. Clarendon Press.
Kempthorne, O. 1969. An Introduction to Genetic Statistics. Iowa State University Press.
Sorensen, D., and D. Gianola. 2002. Likelihood, Bayesian, and MCMC Methods in Quantitative Genetics. Springer.
Saxton, A. M. (Ed). 2004. Genetic Analysis of Complex Traits Using SAS. SAS Press.
Wu, R., C.-X. Ma, and G. Casella. 2007. Statistical Genetics of Quantitative Traits: Linkage, Maps. and QTL. Springer, N.Y
1
Lecture 1 Introduction to Modern Plant
Breeding
Bruce Walsh Notes Introduction to Plant Quantitative Genetics
Tucson. 8-10 Jan 2018
2 ̀
Importance of Plant breeding
• Plant breeding is the most important technology developed by man. It allowed civilization to form and its continual success is critical to maintaining our way of life
• Problem: Feeding 9 billion (+) people with the same (or fewer) inputs – Same or less acreage – Same or less fertilizer, pesticides, water – Adapting to climate and environmental change
3
Goals of Plant breeding • Increase the frequency of favorable alleles
within a line – Additive effects
• Increase the frequency of favorable genotypes within a line – Dominance and interaction effects
• Better adapt crops to specific environments – Region-specific cultivars (high location G x E) – Stability across years within a region (low year-to-
year G x E)
4
Objectives
• Development of pure (i.e. highly inbred) lines with high per se performance
• Development of pure lines with high hybrid performance (either with each other or with a testcross)
• Less emphasis on developing outbred (random-mating) populations with improved performance
• Development of lines with high regional G x E, low year G x E
5
Animal and tree breeding • Similar goals, but since mostly outcrossing,
the goal is to create high-performing populations, not inbred lines
• Generally speaking, inbreeding is bad in animals and many trees
• Focus on finding those parents with the best transmitting abilities (highest breeding values)
• Less of a G x E focus with animals, less of a focus on line and hybrid breeding
6
Special features exploited by plant breeders
• Selfing allows for the capture of specific genotypes, and hence the capture of interactions between alleles and loci (dominance and epistasis) – Homozygous for selfed lines – Heterozygous for crossed lines
• Often high reproductive output (relative to animal breeding)
• Seeds allow for multigeneration progeny testing, wherein individuals are chosen on the performance of their progeny, or of their sibs – Allows for better control over G x E by testing
over multiple sites/years
7
Historical plant breeding • Early origins
– Creation of new lines through species crosses (allopolyploids)
– Visual selection – Early domestication (selection for specific traits for
ease of harvesting) • Biometrical school
– Using crosses to predict average performance under inbreeding or crossing or response to selection
– Better management of G x E
8
Modern tools • Molecular markers
– Initially low density for QTL mapping, introgression of major genes into elite germplasm
– With high-density markers, association mapping and MAS/genomic selection
• New statistical tools – Mixed model methods – Bayesian approaches to handle high-dimensional data sets – New methods to deal with G x E
• Other technologies – Better standardization of field sites (laser-tilled fields, GPS,
better micro- and macro-environmental measurements) – High throughput phenotypic scoring – DH lines
9
Diversity • Plant breeders face the conundrum of using
inbred lines to concentrate elite genotypes, but requiring a very large collection of such lines to store variation for further selection
• Landraces or local cultivars may be highly adapted to specific environments, but otherwise not elite
• Issue with keeping germplasm elite while introgressing genes/regions of interest.
10
Integrated Approaches • How do we best combine the rich history of
quantitative genetics and classical plant breeding with the new tools from genomics and other advances?
• Key: Quantitative genetics has all of the machinery needed to fully incorporate these new sources of information
• The goal of this course is to show how this is done.
1
Lecture 2 Basic Plant Genetics
Bruce Walsh Notes Introduction to Plant Quantitative Genetics
Tucson. 8-10 Jan 2018
2
Overview • Ploidy • Linkage • Linkage disequilibrium (LD) • Genetic markers • Mapping functions • Organelle inheritance • Mating systems and types of crosses • Gene actions
– Dominance and Epistasis – Pleiotropy
3
Ploidy • Most animals are diploid (2n), with their gametes (eggs, sperm)
containing a haploid set of n chromosomes • Polyploids are much more common in plants. • Allopolyploids consist of haploid sets from two (or more)
species – e.g., an allotetraploid is AABB, – One allohexaploid is AABBCC – Generally speaking, allopolyploids largely behave as
diploids, i.e., each pollen/egg gets a haploid set from each of the founding species
• Autopolyploids have multiple haploid sets from the same species – Autoteraploids (4n) and autohexaploids (6n) – these give pollen and eggs with two (2n) and three (3n)
(respectively) copies of each homologous chromosome
4
5
Linkage
• Independent assortment for unlinked genes
• Linkage • Computing expected genotypic
frequencies from linkage
6
Dealing with two (or more) genes
For his 7 traits, Mendel observed Independent Assortment
The genotype at one locus is independent of the second
RR, Rr - round seeds, rr - wrinkled seeds
Pure round, green (RRgg) x pure wrinkled yellow (rrYY)
F1 --> RrYg = round, yellow
What about the F2?
7
Let R- denote RR and Rr. R- are round. Note in F2, Pr(R-) = 1/2 + 1/4 = 3/4
Likewise, Y- are YY or Yg, and are yellow
Phenotype Genotype Frequency
Yellow, round Y-R- (3/4)*(3/4) = 9/16
Yellow, wrinkled Y-rr (3/4)*(1/4) = 3/16
Green, round ggR- (1/4)*(3/4) = 3/16
Green, wrinkled ggrr (1/4)*(1/4) = 1/16
Or a 9:3:3:1 ratio
8
Mendel was wrong: Linkage
Phenotype Genotype Observed Expected
Purple long P-L- 284 215
Purple round P-ll 21 71
Red long ppL- 21 71
Red round ppll 55 24
Bateson and Punnet looked at
flower color: P (purple) dominant over p (red ) pollen shape: L (long) dominant over l (round)
Excess of PL, pl gametes over Pl, pL
Departure from independent assortment
9
Linkage
If genes are located on different chromosomes they (with very few exceptions) show independent assortment.
Indeed, peas have only 7 chromosomes, so was Mendel luckyin choosing seven traits at random that happen to all be on different chromosomes?
However, genes on the same chromosome, especially if they are close to each other, tend to be passed onto their offspring in the same configuration as on the parental chromosomes.
10
Consider the Bateson-Punnet pea data
Let PL / pl denote that in the parent, one chromosome carries the P and L alleles (at the flower color and pollen shape loci, respectively), while the other chromosome carries the p and l alleles.
Unless there is a recombination event, one of the two parental chromosome types (PL or pl) are passed onto the offspring. These are called the parental gametes.
However, if a recombination event occurs, a PL/pl parent can generate Pl and pL recombinant chromosomes to pass onto its offspring.
11
Let c denote the recombination frequency --- the probability that a randomly-chosen gamete from the parent is of the recombinant type (i.e., it is not a parental gamete).
For a PL/pl parent, the gamete frequencies are
Gamete type Frequency Expectation under independent assortment
PL (1-c)/2 1/4
pl (1-c)/2 1/4
pL c/2 1/4
Pl c/2 1/4
12
Gamete type Frequency Expectation under independent assortment
PL (1-c)/2 1/4
pl (1-c)/2 1/4
pL c/2 1/4
Pl c/2 1/4
Parental gametes in excess, as (1-c)/2 > 1/4 for c < 1/2
Recombinant gametes in deficiency, as c/2 < 1/4 for c < 1/2
13
Expected genotype frequencies under linkage
Suppose we cross PL/pl X PL/pl parents
What are the expected frequencies in their offspring?
Pr(PPLL) = Pr(PL|father)*Pr(PL|mother) = [(1-c)/2]*[(1-c)/2] = (1-c)2/4
Recall from previous data that freq(ppll) = 55/381 = 0.144
Hence, (1-c)2/4 = 0.144, or c = 0.24
Likewise, Pr(ppll) = (1-c)2/4
14
A (slightly) more complicated case
Again, assume the parents are both PL/pl. Compute Pr(PpLl)
Two situations, as PpLl could be PL/pl or Pl/pL
Pr(PL/pl) = Pr(PL|dad)*Pr(pl|mom) + Pr(PL|mom)*Pr(pl|dad) = [(1-c)/2]*[(1-c)/2] + [(1-c)/2]*[(1-c)/2]
Pr(Pl/pL) = Pr(Pl|dad)*Pr(pL|mom) + Pr(Pl|mom)*Pr(pl|dad) = (c/2)*(c/2) + (c/2)*(c/2)
Thus, Pr(PpLl) = (1-c)2/2 + c2 /2
15
Generally, to compute the expected genotype probabilities, need to consider the frequencies of gametes produced by both parents.
Suppose dad = Pl/pL, mom = PL/pl
Pr(PPLL) = Pr(PL|dad)*Pr(PL|mom) = [c/2]*[(1-c)/2]
Notation: when PL/pl, we say that alleles P and L are in coupling
When parent is Pl/pL, we say that P and L are in repulsion
In class problems • Suppose c = 0.2
– In a cross of AB/ab X AB/ab, what is freq(AABB)
– Suppose we cross AB/ab X Ab/aB. What is freq(AABB)?
• Now suppose c is unknown, but in a cross of AB/ab x AB/ab, freq(AABB) = 0.25. – What is c?
16
17
Linkage Disequilibrium • Under linkage equilibrium, the frequency of gametes
is the product of allele frequencies, – e.g. Freq(AB) = Freq(A)*Freq(B) – A and B are independent of each other
• If the linkage phase of parents in some set or population departs from random (alleles not independent) , linkage disequilibrium (LD) is said to occur
• The amount DAB of disequilibrium for the AB gamete is given by – DAB = Freq(AB) gamete - Freq(A)*Freq(B) – D > 0 implies AB gamete more frequent than expected – D < 0 implies AB less frequent than expected
18
Dynamics of D
• Under random mating in a large population, allele frequencies do not change. However, gamete frequencies do if there is any LD
• The amount of LD decays by (1-c) each generation – D(t) = (1-c)t D(0)
• The expected frequency of a gamete (say AB) is – Freq(AB) = Freq(A)*Freq(B) + D – Freq(AB in gen t) = Freq(A)*Freq(B) + (1-c)t D(0)
19
AB/ab
Excess of parental gametes AB, ab
linkage
Ab/aB
Excess of parental gametes Ab, aB
AB/ab
Excess of parental gametes AB, ab
Ab/aB
Excess of parental gametes Ab, aB
Pool all gametes: AB, ab, Ab, aB equally frequent
No LD: random distribution of linkage phases
20
AB/ab
Excess of parental gametes AB, ab
linkage
AB/ab
Excess of parental gametes AB, ab
AB/ab
Excess of parental gametes AB, ab
Ab/aB
Excess of parental gametes Ab, aB
Pool all gametes: Excess of AB, ab due to an excess of AB/ab parents
With LD, nonrandom distribution of linkage phase
21
Molecular Markers
SNP -- single nucleotide polymorphism. A particular position on the DNA (say base 123,321 on chromosome 1) that has two different nucleotides (say G or A) segregating
STR -- simple tandem arrays. An STR locus consists of a number of short repeats, with alleles defined by the number of repeats. For example, you might have 6 and 4 copies of the repeat on your two chromosome 7s
In the molecular era, genetic maps are based not on alleles with large phenotypic effects (i.e., green vs. yellow peas), but rather on molecular markers
Even with whole-genome sequencing, sites are still classified into these two classes (plus other types)
22
SNPs vs STRs SNPs
Cons: Less polymorphic (at most 2 alleles)
Pros: Low mutation rates, alleles very stable Excellent for looking at historical long-term associations (association mapping) Cheap to score 100,000s (+) on a single SNP Chip
STRs
Cons: High mutation rate
Pros: Very highly polymorphic (more information/site) Excellent for linkage studies within an extended pedigree (QTL mapping in families or pedigrees)
23
Genetic maps • Published genetic maps give the distances
between molecular markers along a chromosome in terms of map units (m, expected number of crossovers between them), rather than their recombination frequencies c. Why? – c is not additive over loci, while m is – Hence, m is a more natural metric – Transition from an observed c to an estimated m
requires a mapping function, which requires an assumption about how interference works
24
Genetic Maps and Mapping Functions The unit of genetic distance between two markers is the recombination frequency, c
If the phase of a parent is AB/ab, then 1-c is the frequency of “parental” gametes (e.g., AB and ab), while c is the frequency of “nonparental” gametes (e.g.. Ab and aB).
A parental gamete results from an EVEN number of crossovers, e.g., 0, 2, 4, etc.
For a nonparental (also called a recombinant) gamete, need an ODD number of crossovers between A & b e.g., 1, 3, 5, etc.
25
Hence, simply using the frequency of “recombinant” (i.e. nonparental) gametes UNDERESTIMATES the m number of crossovers, with E[m] > c
Mapping functions attempt to estimate the expected number of crossovers m from observed recombination frequencies c
When considering two linked loci, the phenomena of interference must be taken into account
The presence of a crossover in one interval typically decreases the likelihood of a nearby crossover
In particular, c = Prob(odd number of crossovers)
26
Suppose the order of the genes is A-B-C.
If there is no interference (i.e., crossovers occur independently of each other) then
Probability(odd number of crossovers btw A and C)
Even number of crossovers btw A & B, Odd number between B & C
odd number in A-B, even number in B-C
cAC = cAB (1-cBC) + (1-cAB) cBC = cAB + cBC – 2 cAB cBC
27
We need to assume independence of crossovers in order to multiply these two probabilities
When interference is present, we can write this as
δ = interference parameter
δ = 1 --> complete interference: The presence of a crossover eliminates nearby crossovers
δ = 0 --> No interference. Crossovers occur independently of each other
cAC = cAB + cBC – 2(1-δ) cAB cBC
28
Mapping functions. Moving from c to m
Haldane’s mapping function (gives Haldane map distances)
Assume the number k of crossovers in a region follows a Poisson distribution with parameter m
This makes the assumption of NO INTERFERENCE
Pr(Poisson = k) = λk Exp[-λ]/k! λ = expected number of successes
c = 1 X
k = 0 p ( m ; 2 k + 1 ) = e ° m
1 X
k = 0
m 2 k + 1
( 2 k + 1 ) ! = 1 ° e ° 2 m
2
29
Prob(Odd number of crossovers)
Odd number
Usually reported in units of Morgans or centiMorgans (cM)
One morgan --> m = 1.0. One cM --> m = 0.01
c = 1 X
k = 0 p ( m ; 2 k + 1 ) = e ° m
1 X
k = 0
m 2 k + 1
( 2 k + 1 ) ! = 1 ° e ° 2 m
2
m = ° l n ( 1 ° c 2
Relates recombination fraction c to expected number of crossovers m
30
Organelle genetics • With autosomal loci, each parent contributes
an equal number of chromosomes • However, the mitochondrial and chloroplast
genomes are only passed from the mother. – While these have a small number of genes
(mtDNA ~ 20, cpDNA ~ 50-100), they can still have phenotypic effects
– Example: cytoplasmic sterility factors on mtDNA used in maize to avoid having to detassle pollen plants
31
Systems of matings and types of crosses
• Types of crosses – F1, F2, Backcrosses – Fk, Advanced intercross (AIC) lines – Isogenic/inbred lines
• Recombinant inbred lines (RILS)
– Selfing • Sk lines
– Doubled haploids
32
P1 x P2
F1 F1 B1
Backcross design
B2 F1 Backcross design
Cross % P1 % P2
F1 50 50
B1 75 25
B2 25 75
B1 (k) 1-(1/2)k+1 (1/2)k+1
B2(k) (1/2)k+1 1-(1/2)k+1
B1(2) = B1 X P1 Repeating backcrossing to the P1 gives B1(k) = Bk-1 X P1 lines
Fraction genetic contributions from each parent
33
F2-based crosses • Randomly mating the F1 generates the F2
– These can also be generated by selfing each F1 • Isogenic (or inbred) lines are created by taking a set
of F1 individuals and selfing each for 5-10 generations to create a series of inbred lines – Generates a series of pure lines that capture some of the
initially segregating variation – When generated following a cross, also called RILs
• Advanced intercross lines (AIC) are created by randomly-mating the F2 line for multiple generations (AIC(k) = Fk = k generations of random mating) – Has the effect of expanding the genetic map in
the AIC(k), recombination rate between two markers ~ c*k
34
Selfing, Doubled Haploids • If two inbred lines are crossed, all of the F1 are
heterozygotes. If we self the F1 for k generations, then the fraction of loci that are heterozygotes is (1/2)k. – Less that 1% in the F7, 0.09% in F10
– Sk lines refer to k generations of selfing an F2, e.g., with only selfing Sk = Fk+2
– S0 = The F2 from selfing an F1 line, • Doubled haploids, DH, (the doubling of a haploid set in a
gamete) produces fully inbred individuals in one generation – DH lines capture most of the initial LD (only a single
generation of recombination) – Selfed-generated lines further decay some LD, but not as
efficiently as random mating.
35
Selfing and Favorable Alleles • Suppose inbred lines 1 and 2 are each fixed for five
favorable alleles not found in the other. – In the F1, all individuals carry at least one favorable allele at
each locus – In the F2, the probability a locus contains at least one
favorable allele is Pr(favorable homozygote) + Pr(favorable heterozygote) = (1/4) + (1/2) = 3/4.
• Pr(all 10 loci do) = (3/4)10 = 0.056 – If fully inbred, Pr(at least one favorable allele at a locus) =
1/2 • Pr(all 10 loci fixed for favorable allele) = (1/2)10 = 1/1024, • Roughly 57 times less likely than an F2. • Hence, while inbred lines build up loci fixed for both favorable
alleles, they have less loci with favorable alleles than an F1 or F2.
36
Effects of selection
• Now suppose that selection increases the frequency of each favorable allele from 0.5 to 0.9 – For an inbred, now Pr(all loci fixed for
favorable alleles) = 0.910 = 0.35 – For a random-mating population, Pr(all loci
contain a favorable allele) = • (1-0.12)10 = 0.904
37
Types of Gene Action • At a single gene, we can see dominance
– The heterozygote has a phenotype that is different from the average of the two homozygotes
– Interaction between the two alleles at a locus • We can also have pleiotropy, where a single gene
influences two or more traits. • When two (or more) genes influence the same trait,
the possibility of epistasis exists – The two-locus phenotype is not simply the sum of
the two single-locus phenotype
38
Epistasis • Consider the two-locus genotype AiAjBkBl
• Let Gij.. = Gij denote the average deviation between an AiAj individual and the population mean, same for Gkl
• If Gijkl = u + Gij + Gkl , i.e., the two-locus genotypic value is the sum of each single locus genotypic values (based on deviations from the mean u), then we same genotypic values are additive across loci (while dominance might still occur at either locus) – If this is NOT the case, we case that epistasis occurs --- the
two-locus genotype departs from the average contribution of both single loci.
– Dominance = interaction between alleles at the SAME locus – Epistasis = interaction between alleles at DIFFERENT loci
39
Example
AA Aa aa
BB 10 15 20
Bb 10 15 20
bb 5 10 15
B is dominant to b, A is additive (no dominance) However, no epistasis, as phenotypic value is (B phenotype) + 5*(# of a alleles), namely the sum of the two genotypic values at each locus
10
10
5
0 5 10
Lecture 2b: The Genetic Determinants of Size in Plants, Animals, and Humans
Mike Gore lecture notesTucson Plant Breeding Institute
Module 1
1
• Pea and corn• Dog• Human
Lecture 2b: The Genetic Determinants of Size in Plants, Animals, and Humans
2
• Natural populations: natural selection for plant height toimprove light interception, carbon and nutrient capture, weedcompetition, and seed dispersal
Height adaptations are essential to plant fitness and agricultural performance
• Breeding populations: artificial selection for plant height toincrease harvest uniformity, favorably partition carbon andnutrients, and enhance input use efficiency
Peiffer et al. 2014. Genetics 196:1337-1356 3
Father of Green Revolution: Borlaug developed high yielding, short stature wheat at the International Maize and Wheat Improvement Center (CIMMYT, MX) in 1960s
Starts work in India and Pakistan
His developed fertilizer-responsive wheat varieties growing in Latin America and Asia saved millions, if not a billion, people from starvation. Received 1970 Nobel Peace Prize.
http://rationalwiki.org/wiki/Norman_Borlaug4
Mendel’s Peas 1866
All of Mendel’s seven traits in pea were controlled by single genes that showed independent assortment
http://www.ck12.org/book/CK-12-Biology-Concepts/r11/section/3.1/
5
Lester et al. 1997. Plant Cell 9:1435-1443
Le/Letall
le/ledwarf
Le (stem length) gene controls internode elongation
GA20
GA1bioactive
Gibberellin3β-Hydroxylase
Reduced activity of le is from an alanine-to-threonine substitutionin active site of enzyme
6
Since Mendel’s experiments, plant height has continued to be studied because of its high heritability and ease of measurement in plant populations
nps.gov/history/history/online_books/science/8/chap5.htm
1974 – Steel measuring tape 2014 – Barcoded measuring tape7
More than 40 genes at which mutations have large effects on height have been identified in maize
These 40 genes are mostly involved in hormone (e.g., auxin, gibberellin and brassinosteroid) synthesis, transport, and signaling
“breakdown”“biosynthesis”
Salas Fernandez et al. 2009. Trends Plant Sci. 14:454-461
Peiffer et al. 2014. Genetics 196:1337-1356 8
Evolutionary models predict that loss of unfavorable large-effect alleles is likely as a population approaches optimal fitness/productivity in agricultural systems
Peiffer et al. 2014. Genetics 196:1337-1356
Mu
ltan
iet a
l. 2
00
3. S
cie
nce
30
2:8
1-8
4• Maize brachytic2 (br2)mutants have compactlower stalk internodes
• The height reductionresults from the loss of a P-glycoprotein that modulatespolar auxin transport in themaize stalk
9
In total, 4892 NAM and IBM RILs were scored for PHT in >7 environments
Joint-linkage mapping of QTL for plant height (PHT) in the maize Nested Association Mapping (NAM) panel
Peiffer et al. 2014. Genetics 196:1337-1356 10
H2
Wide range in phenotypic variation for PHT (cm) across families, with transgressive segregation in all families
High heritability across and within NAM familiesPeiffer et al. 2014. Genetics 196:1337-1356
11
The joint-linkage QTL models identified 35 QTL thatexplained ~76% of PHT variation
Largest effect locus: 2.1 ± 0.9% of PHT variation
The QTL effect estimates (4-6 cm) were validated by fine mapping in two near-isogenic line families
The intervals contained >100genes, with no obvious candidates
Peiffer et al. 2014. Genetics 196:1337-1356 12
JL-assisted GWAS with ~30 million SNPs: Resolving the identified QTL associated with height in maize
Peiffer et al. 2014. Genetics 196:1337-1356
The molecular basis of natural PHT variation in maize remains largely elusive
Variation in PHT is well explained by Fisher’s infinitesimal model of genetic architecture
3 out of > 120 candidates for height loci
13
• Pea and corn• Dog• Human
Lecture 2b: The Genetic Determinants of Size in Plants, Animals, and Humans
14
Domestic dog breeds exhibit tremendous diversity in body size
http://hondentrimsalonscissors.nl/
Domestic dog originated from the gray wolf 15,000 years ago, but most breeds are only a few hundred years old
15
Within a dog breed: Fine mapping of a major QTL for body size on chr. 15 in Portuguese water dog (PWD)
Sutter et al. 2007. Science 316: 112–115
n = 463
Insulin-like growth factor 1 (IGF1) is known to influence body size in mice and humans
http://www.dogbreedinfo.com/portuguesewaterdog.htm
16
Sutter et al. 2007. Science 316: 112–115
In PWD, 15% of the phenotypic variance in skeletal size is explained by the IGF1 haplotype
In PWD, 96% of chromosomes carry only one of two haplotypes –B haplotype confirmed smaller skeletal size
http://www.dogbreedinfo.com/portuguesewaterdog.htm
17
Man
n-W
hit
ne
y U
(M
WU
) P
-val
ues
Association mapping of body weight in 14 small and 9 giant dog breeds identified IGF1
116 SNPs 83 SNPs (control chr)
IGF1
Sutter et al. 2007. Science 316: 112–11518
Yellow – ancestral allele (golden jackal)Blue – derived allele
The B haplotype at IGF1 is most associated with small body size
Sutter et al. 2007. Science 316: 112–115
Strong linkage disequilibrium among variants prevented the identification of a causative variant
Small n = 14 Large n = 9
19
The IGF1 small dog haplotype is derived from Middle Eastern gray wolves
http://www.redorbit.com/education/reference_library/animal_kingdom/mammalia/2580178/southerneast_asian_wolf/
A few major QTL appear to control body size in dogs – result of selection for novelty and bottleneck 20
• Pea and corn• Dog• Human
Lecture 2b: The Genetic Determinants of Size in Plants, Animals, and Humans
21
Human height is a classic polygenic trait with more than 80% of the variation within a given population estimated to be attributable to additive genetic factors
“Whenever a large sample of chaotic elements are taken in hand and marshaled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.” - Sir Francis Galton, Natural Inheritance, 1889, describing what is now known as the central limit theorem
http://terrytao.wordpress.com/2010/09/14/a-second-draft-of-a-non-technical-article-on-universality/n = 143
~165 cm (5.41 ft) for females ~178 cm (5.84 ft) for males
Visscher 2008. Nat. Genet. 40:489-490 22
Mutations that cause extreme stature are rare and highly unlikely to explain natural variation in human height
http://www.ibtimes.co.uk/worlds-tallest-man-meets-shortest-guinness-world-records-day-2014-makes-wacky-additions-almanac-1474652
251 cm (8.23 ft)
54.6 cm (1.79 ft)
23
A GWAS of height in a population of 183,727 individuals identified 180 SNP loci with small additive effects (stage 1)
Lango Allen et al. 2010. Nature 467:832 - 838
Enrichment of signals at genes in biological pathways and that underlie skeletal growth defects
FINGERSTURE 0.08 ± 0.02RS2 0.11 ± 0.01RS3 0.11 ± 0.01GOOD 0.09 ± 0.02QIMR 0.11 ± 0.02
Proportion of variance explained by 180 SNP loci in separate pops(stage 2)
24
Yang et al. 2010. Nat. Genet. 42: 565–569
A genome-wide prediction of height in a population of 3,935 individuals with 294,831 common SNPs that together explain 45-54% of phenotypic variance
Remaining heritability is unexplained likely from incomplete LD of SNPs with small effect, low MAF causal variants
(adjustment for prediction error- incomplete LD between SNPs and causal variants)
25
A GWAS meta-analysis of 78 height studies on 253,288 individuals identified sets of common SNPs (0.1-1% of total) that explained ~16-29% of phenotypic variance
Genetic architecture for height is defined by a very large finite number (thousands) of causal variants
Wood et al. 2014. Nat. Genet. 46:1173-1186
Validation in five independent populations excluded from meta‐analysis
All common SNPs explain even more!
26
Wood et al. 2014. Nat. Genet. 46:1173-1186
Genes at 423 loci tended to be highly expressed in tissues related to cartilage, joints and spine and other musculoskeletal, cardiovascular and endocrine tissues
Tissue enrichment combined with pruned gene set network 37,427 human microarray samples
Growth genes
27
1
Lecture 3: Basic Probability and
Statistical Tools Bruce Walsh Notes
Introduction to Plant Quantitative Genetics Tucson. 8-10 Jan 2018
I: Probability
2
3
Basic probability • Events are possible outcomes from some
random process – e.g., a genotype is AA, a phenotype is larger
than 20 • Pr(E) denotes the probability of an event E • Pr(E) is between zero and one • The sum of the probabilities of all possible
nonoverlapping events is one. – e.g, if the possible events are E1 , … , Ek, then – Pr(E1) + … + Pr(Ek) = 1
4
The AND rule • Consider two possible events, E1 and E2. • If these are independent (knowledge that
one has occurred does not change the probability of the second), then the joint probability Pr(E1,E2), the Probability of E1 AND E2 is Pr(E1,E2), = Pr(E1)* Pr(E2),
• Hence, with independence, AND = multiply • Conditional probability is used when the
events are NOT independent
5
Example
• Consider the cross AaBbCc X aaBbCc – What is the probability of an aabbcc offspring? – Assuming independent assortment (no linkage) – = Pr(aa | Aa x aa) * Pr(bb | Bb x Bb) * Pr(cc | Cc x
Cc) = (1/2)(1/4)(1/4) = 1/32 • How many offspring do we need to score to have a
90% probability of seeing at least one? – Let p = 1/32. Prob(not seeing aabbcc in n
offspring) = (1-p)n. – Prob(at least one) = 0.9 implies Prob(none) = 0.1 – (1-p)n = 0.1, or n = log(0.1)/log(1-1/32) = 72.5
6
The OR rule • Again, consider two possible events, E1 and E2. • If these events are NONOVERLAPPING (they contain
no common elements), then Pr(E1 or E2) = Pr(E1) + Pr( E2)
• Hence, OR = add • Example:
– What is the probability that a genotype is A-, i.e., that is AA or Aa?
– The events genotype = AA and genotype = Aa are nonoverlapping
– Hence, Pr(A-) = Pr(AA or Aa) = Pr(AA) + Pr(Aa)
7
Conditional Probability
• It is ALWAYS true that – Pr(A,B) = P(A|B)P(B) = P(B|A)P(A) – P(A|B) is the conditional probability of A given B – P(A) is the marginal probability of A – P(A,B) is the joint probability of A and B – If P(A|B) = P(A) for all possible B values, then A
and B are independent • Note that
– P(A|B) = P(A,B)/P(B)
8
Examples of Prob (cont) • Recall that yellow peas (Y-) are dominant to green
peas (gg). Consider the F2 in a cross of YY x gg. – What is the probability of a yellow F2 offspring?
• Pr(yellow) = Pr(YY or Yg) = Pr(YY) + Pr(Yg) =1/4 + 1/2 = 3/4
– What is the probability that a yellow F2 offspring is a YY homozygote?
• Pr(YY | F2 Yellow) = Pr(YY and F2 Yellow)/Pr(F2 yellow) = (1/4)/(3/4) = 1/3.
9
Bayes’ Theorem Suppose an unobservable random variable (RV) takes on values b1 .. bn
Suppose that we observe the outcome A of an RV correlated with b. What can we say about b given A?
Bayes’ theorem:
A typical application in genetics is that A is some phenotype and b indexes some underlying (but unknown) genotype
Example • You have an F2 plant that gives yellow peas
from a pure yellow (YY) x pure green (gg) cross (green recessive) – Hence, your plant is Y-, but could be YY or Yg. – You test-cross this plant by crossing to a gg
parent. – If parent in YY, except only yellow. If Yg
expect 50:50 yellow/green
• If you score 5 offspring and all are yellow, what is the probability this plant is YY?
10
• You want to compute – Pr(F2 yellow is YY | 5 yellow offspring)
• First, you need your prior – Prob(F2 yellow is YY) = 1/3 – Prob(F2 yellow is Yg) = 2/3
• Second, – Prob(5 yellow offspring | YY) = 1 – Prob(5 yellow offspring | Yg) = (1/2)5
• Note that Pr(5 yellow offspring) = Prob(5 yellow offspring | YY) *Pr(YY) + Prob(5 yellow offspring | Yg) *Pr(Yg) = 1*(1/3) + (1/2)5 *(2/3) = 0.3542
• From Bayes – Pr(F2 yellow is YY | 5 yellow offspring)
• = Prob(5 yellow offspring | YY) *Pr(YY) / Pr(5 yellow offspring) • = 1*(1/3)/0.3542 = .941
11
12
Genotype QQ Qq qq
Freq(genotype) 0.5 0.3 0.2
Pr(height >70 | genotype) 0.3 0.6 0.9
Pr(height > 70) = 0.3*0.5 +0.6*0.3 + 0.9*0.2 = 0.51
Pr(QQ | height > 70) = Pr(QQ) * Pr (height > 70 | QQ)
Pr(height > 70)
= 0.5*0.3 / 0.51 = 0.294
Second example: Suppose height > 70. What is the probability individual is QQ? Qq? qq?
Suppose:
2. Probability distributions and random variables
13
14
A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution
A discrete RV x --- takes on values X1, X2, … Xk
Probability distribution: Pi = Pr(x = Xi)
Pi > 0, Σ Pi = 1
Discrete Random Variables
Probabilities are non-negative and sum to one
Example: Suppose the probability of seeing no individuals of genotype AABB in our sample is 0.1. What is the probability of seeing at least one? Pr(none) + Pr(at least one) = 1, hence Pr(at least one) = 1-Pr(none) = 0.9
15
The Binominal Distribution • What is the expected number of successes in a series
of n trails where the probability p of success is the same for each trail?
• This is given by the binominal distribution, – Pr(k successes | n, p) = n!/[ (n-k)! k!] pk (1-p)n-k
• Example: Suppose p = 0.05 and n = 10. What is the probability of seeing EXACTLY one success? – Pr(k=1) = 10!/(9!*1!) 0.051 0.959 = 10* 0.051 0.959 = 0.315
• What is the probability of seeing AT LEAST one success? – Pr(k > 0) = 1 -Pr(k=0) = 1-(1-0.05)10 = 0.401
16
The Poisson Distribution • Given that the expected number of successes in our
sample is λ, what is the probability that we see k successes?
• This is given by the Poisson distribution – Pr(k successes | λ) = e-λ λk/k!
• Example: suppose λ = 0.5. – Pr(k = 1) = e-0.50.51/1! = 0.303 – Pr(at least one success) = 1- Pr(k = 0) = 1-e-0.5 = 0.393
• Connection with binominal: λ = n*p – Can either use Poisson as an approximation or when the
sample size n is not given
17
The geometric distribution • Given success probability p per trail, how
many failures k occur before the first success?
• This is a waiting-time (as opposed to a counting) problem, and is given by the geometric distribution – Pr(k failures before a success) = (1-p)kp – Example: Suppose p = 0.05. What is the
probability of AT LEAST one success in the first 10 trails?
– = 1 - Pr(none in 1st 10) = 1-(1-p)10 = 0.401
A continuous RV x can take on any possible value in some interval (or set of intervals). The probability distribution is defined by the probability density function (or pdf), p(x)
Continuous Random Variables
Finally, the cdf, or cumulative probability function, is defined as cdf(z) = Pr( x < z)
19
Example: The normal (or Gaussian) distribution
Mean µ, variance σ2
Unit normal (mean 0, variance 1)
20
Mean (µ) = peak of distribution
The variance is a measure of spread about the mean. The smaller σ2, the narrower the distribution about the mean
3. Expectations and descriptive statistics
21
22
Expectations of Random Variables The expected value, E [f(x)], of some function f of the random variable x is just the average value of that function
E[x] = the (arithmetic) mean, µ, of a random variable x
23
Expectations of Random Variables E[ (x - µ)2 ] = σ 2, the variance of x
More generally, the rth moment about the mean is given by E[ (x - µ)r ] r = 2: variance (σ2)
r = 4: (scaled) kurtosis (3σ4 for a normal)
r = 3: skew (value is zero for a normal)
Useful properties of expectations
24
Covariances • Cov(x,y) = E [(x-µx)(y-µy)]
X
Y
cov(X,Y) > 0
Cov(x,y) > 0, positive (linear) association between x & y
• = E [x*y] - E[x]*E[y]
25
Cov(x,y) < 0, negative (linear) association between x & y
X
Y
cov(X,Y) < 0
Cov(x,y) = 0, no linear association between x & y
X
Y
cov(X,Y) = 0
26
Cov(x,y) = 0 DOES NOT imply no association
X
Y
cov(X,Y) = 0
If x and y are independent, then cov(x,y) = 0
However, cov(x,y) = 0 DOES NOT imply that x and y are independent.
27
Correlation Cov = 10 tells us nothing about the strength of an association
What is needed is an absolute measure of association
This is provided by the correlation, r(x,y)
r = 1 implies a perfect (positive) linear association
r = - 1 implies a perfect (negative) linear association
28
Useful Properties of Variances and
Covariances • Symmetry, Cov(x,y) = Cov(y,x) • The covariance of a variable with itself is the
variance, Cov(x,x) = Var(x) • If a is a constant, then
– Cov(ax,y) = a Cov(x,y) • Var(a x) = a2 Var(x).
– Var(ax) = Cov(ax,ax) = a2 Cov(x,x) = a2Var(x)
• Cov(x+y,z) = Cov(x,z) + Cov(y,z)
29
Hence, the variance of a sum equals the sum of the Variances ONLY when the elements are uncorrelated
More generally
Question: What is Var(x-y)?
30
Regressions Consider the best (linear) predictor of y given we know x
The slope of this linear regression is a function of Cov,
The fraction of the variation in y accounted for by knowing x, i.e,Var(yhat - y), is Var(y) [1-r2]
31
In this case, the fraction of variation accounted for by the regression is b2
Relationship between the correlation and the regression slope:
If Var(x) = Var(y), then by|x = b x|y = r(x,y)
32
r 2 = 0.6
r 2 = 0.9 r 2 = 1.0
r 2 = 0.3
33
Properties of Least-squares Regressions The slope and intercept obtained by least-squares: minimize the sum of squared residuals:
• The average value of the residual is zero
• The LS solution maximizes the amount of variation in y that can be explained by a linear regression on x
• The residual errors around the least-squares regression are uncorrelated with the predictor variable x
• Fraction of variance in y accounted by the regression is r2
• Homoscedastic vs. heteroscedastic residual variances
4. Different methods of statistical analysis
34
35
Different methods of analysis • Parameters of these various models can be
estimated in a number of frameworks • Method of moments
– Very little assumptions about the underlying distribution. Typically, the mean of some statistic has an expected value of the parameter
– Example: Estimate of the mean µ given by the sample mean, xbar, as E(xbar) = µ.
– While estimation does not require distribution assumptions, confidence intervals and hypothesis testing do
• Distribution-based estimation – The explicit form of the distribution used
36
Distribution-based estimation • Maximum likelihood estimation
– MLE – REML – More in Lynch & Walsh (book) Appendix 3
• Bayesian – More in Walsh & Lynch (online chapters = Vol 2)
Appendices 2,3
37
Maximum Likelihood p(x1,…, xn | θ ) = density of the observed data (x1,…, xn) given the (unknown) distribution parameter(s) θ
Fisher suggested the method of maximum likelihood given the data (x1,…, xn) find the value(s) of θ that maximize p(x1,…, xn | θ )
We usually express p(x1,…, xn | θ) as a likelihood function l ( θ | x1,…, xn ) to remind us that it is dependent on the observed data
The Maximum Likelihood Estimator (MLE) of θ are the value(s) that maximize the likelihood function l given the observed data x1,…, xn .
38
l (θ | x)
MLE of θ
The curvature of the likelihood surface in the neighborhood of the MLE informs us as to the precision of the estimator. A narrow peak = high precision. A board peak = low precision
This is formalized by looking at the log-likelihood surface, L = ln [l (θ | x) ]. Since ln is a monotonic function, the value of θ that maximizes l also maximizes L
The larger the curvature, the smaller the variance
θ
39
Likelihood Ratio tests Hypothesis testing in the ML frameworks occurs through likelihood-ratio (LR) tests
For large sample sizes (generally) LR approaches a Chi-square distribution with r df (r = number of parameters assigned fixed values under null)
θr is the MLE under the restricted conditions (some parameters specified, e.g., var =1)
Θ is the MLE under the unrestricted conditions (no parameters specified)
40
Bayesian Statistics An extension of likelihood is Bayesian statistics
p(θ | x) = C * l(x | θ) p(θ)
Instead of simply estimating a point estimate (e.g., the MLE), the goal is the estimate the entire distribution for the unknown parameter θ given the data x
p(θ | x) is the posterior distribution for θ given the data x
l(x | θ) is just the likelihood function
p(θ) is the prior distribution on θ.
41
Bayesian Statistics Why Bayesian?
• Exact for any sample size
• Marginal posteriors
• Efficient use of any prior information
• MCMC (such as Gibbs sampling) methods
Priors quantify the strength of any prior information. Often these are taken to be diffuse (with a high variance), so prior weights on θ spread over a wide range of possible values.
42
p values in Hypothesis testing • The p value of a test statistic is the
probability of seeing a value as large (or larger) under the null hypothesis
• For example, suppose you are assuming a random variable comes from a normal with mean zero and variance one. – The probability of seeing a value more extreme
than 2 (i.e., greater than two or less than -2) is 0.0455, the p value associated with this value of the test statistic.
43
Significance and multiple comparisons • One could either report a p value or have some criteria (i.e., any
test with a p value less than 0.01) that declares a test to be significant (and hence a positive result) – p is the probability of a false positive, the probability of
declaring a test under the null as being significant. • The problem of multiple comparisons arises when a large
number of tests are performed. – Suppose our significance threshold is p = 0.005, but 1000
tests are done. Under the null, we still expect 0.005*1000 = 5 significant tests
– Bonferroni corrections are done by first setting a significance level for the entire COLLECTION of tests (say π = 0.05). To have this level experiment-wide control of false positives requires each test uses p = π /n
• For n = 1000, an experiment-wide false positive rate (probability) of 0.05 declares significance only with the p value for a test is less than 0.05/1000 = 0.00005.
44
Power and Type I/II errors • A Type I error is the probability of declaring a
test to be significant when the null is true (a false positive)
• The power of a statistical test (a function of the sample size and the true parameters) is the probability of declaring a test to be significant when the null is false. – A Type II error occurs when we fail to declare a
test significant when it is not from the null (i.e., a false negative)
45
FDR, the false discovery rate • p is the probability of declaring a test under the null
to be significant (the false-positive rate) • When many tests are expected to be significant (i.e.,
looking for differences in expression over a large number of genes), a more appropriate measure is the false discovery rate (or FDR), the number of false positives among all tests declared to be significant. – Example: Suppose 1000 tests with a significant threshold of
p = 0.005 is used. Expect 5 false positives, but suppose that 30 significant tests are found. Here the FDR = 5/30 = 0.167.
– Hence, 16.7% of the positive tests are false positives
1
Lecture 4: Allelic Effects and Genetic
Variances
Bruce Walsh Notes Introduction to Plant Quantitative Genetics
Tucson. 8-10 Jan 2018
2
Quantitative Genetics
The analysis of traits whose variation is determined by both a
number of genes and environmental factors
Phenotype is highly uninformative as to underlying genotype
3
Complex (or Quantitative) trait • No (apparent) simple Mendelian basis for
variation in the trait • May be a single gene strongly influenced by
environmental factors • May be the result of a number of genes of
equal (or differing) effect • Most likely, a combination of both multiple
genes and environmental factors • Example: Blood pressure, cholesterol levels
– Known genetic and environmental risk factors • Molecular traits can also be quantitative traits
– mRNA level on a microarray analysis – Protein spot volume on a 2-D gel
4
Phenotypic distribution of a trait
5
Consider a specific locus influencing the trait
For this locus, mean phenotype = 0.15, while overall mean phenotype = 0
Hence, it is very hard to distinguish the QQ individuals from all others simply from their phenotypic values
Values for QQ individuals shaded in dark green
6
Basic model of Quantitative Genetics
Basic model: P = G + E
Phenotypic value -- we will occasionally also use z for this value
Genotypic value
Environmental value
G = average phenotypic value for that genotype if we are able to replicate it over the universe of environmental values, G = E[P]
Hence, genotypic values are functions of the environments experienced.
7
Basic model of Quantitative Genetics Basic model: P = G + E
G = average phenotypic value for that genotype if we are able to replicate it over the universe of environmental values, G = E[P]
G x E interaction --- The performance of a particular genotype in a particular environment differs from the sum of the average performance of that genotype over all environments and the average performance of that environment over all genotypes. Basic model now becomes P = G + E + GE
G = average value of an inbred line over a series of environments
8
East (1911) data on US maize
crosses
9 Each sample (P1, P2, F1) has same G, all variation in P is due to variation in E
Same G, Var(P) = Var(E)
10
All same G, hence Var(P) = Var(E)
Variation in G Var(P) = Var(G) + Var(E)
Var(F2) > Var(F1) due to Variation in G
Johannsen (1903) bean data
• Johannsen had a series of fully inbred (= pure) lines.
• There was a consistent between-line difference in the mean bean size – Differences in G across lines
• However, within a given line, size of parental seed independent of size of offspring speed – No variation in G within a line
11
12
13
Goals of Quantitative Genetics • Partition total trait variation into genetic (nature) vs.
environmental (nurture) components • Predict resemblance between relatives
– If a sib has a disease/trait, what are your odds? – Selection response – Change in mean under inbreeding, outcrossing, assortative
mating • Find the underlying loci contributing to genetic
variation – QTL -- quantitative trait loci
• Deduce molecular basis for genetic trait variation • eQTLs -- expression QTLs, loci with a quantitative
influence on gene expression – e.g., QTLs influencing mRNA abundance on a microarray
14
The transmission of genotypes versus alleles
• With fully inbred lines, offspring have the same genotype as their parent, and hence the entire parental genotypic value G is passed along – Hence, favorable interactions between alleles (such as with
dominance) are not lost by randomization under random mating but rather passed along.
• When offspring are generated by crossing (or random mating), each parent contributes a single allele at each locus to its offspring, and hence only passes along a PART of its genotypic value
• This part is determined by the average effect of the allele – Downside is that favorable interaction between alleles are NOT
passed along to their offspring in a diploid (but, as we will see, are in an autoteraploid)
15
Genotypic values It will prove very useful to decompose the genotypic value into the difference between homozygotes (2a) and a measure of dominance (d or k = d/a)
aa Aa AA
C - a C + d C + a
Note that the constant C is the average value of the two homozygotes.
If no dominance, d = 0, as heterozygote value equals the average of the two parents. Can also write d = ka, so that G(Aa) = C + a(1 + k)
16
Computing a and d
Genotype aa Aa AA
Trait value 10 15 16
Suppose a major locus influences plant height, with the following values
C = [G(AA) + G(aa)]/2 = (16+10)/2 = 13 a = [G(AA) - G(aa)]/2 = (16-10)/2 = 3 d = G(Aa)] - [G(AA) + G(aa)]/2 = G(Aa)] - C = 15 - 13 = 2
17
Population means: Random mating Let p = freq(A), q = 1-p = freq(a). Assuming random-mating (Hardy-Weinberg frequencies),
Genotype aa Aa AA
Value C - a C + d C + a
Frequency q2 2pq p2
Mean = q2(C - a) + 2pq(C + d) + p2(C + a) µRM = C + a(p2-q2) + d(2pq)
Contribution from homozygotes
Contribution from heterozygotes
18
Population means: Inbred cross F2 Suppose two inbred lines are crossed. If A is fixed in one line and a in the other, then p = q = 1/2
Genotype aa Aa AA
Value C - a C + d C + a
Frequency 1/4 1/2 1/4
Mean = (1/4)(C - a) + (1/2)(C + d) + (1/4)( C + a) µRM = C + d/2
Note that C is the average of the two parental lines, so when d > 0, F2 exceeds this (heterosis). Note also that the F1 exceeds this average by d, so only half of this passed onto F2.
19
Population means: RILs from an F2 A large number of F2 individuals are fully inbred, either by selfing for many generations or by generating doubled haploids. If p and q denote the F2 frequencies of A and a, what is the expected mean over the set of resulting RILS?
Genotype aa Aa AA
Value C - a C + d C + a
Frequency q 0 p
µRILs = C + a(p-q)
Note this is independent of the amount of dominance (d)
20
The average effect of an allele
• The average effect αA of an allele A is defined by the difference between offspring that gets that allele and a random offspring. – αA = mean(offspring value given parent transmits
A) - mean(all offspring) – Similar definition for αa.
• Note that while C, a and d (the genotypic parameters) do not change with allele frequency, αx is clearly a function of the frequencies of alleles with which allele x combines.
21
Random mating Consider the average effect of allele A when a parent is randomly- mated to another individual from its population
Allele from other parent
Probability Genotype Value
A p AA C + a
a q Aa C + d
Suppose parent contributes A
Mean(A transmitted) = p(C + a) + q(C + d) = C + pa + qd
αA = Mean(A transmitted) - µ = q[a + d(q-p)]
22
Random mating
Allele from other parent
Probability Genotype Value
A p Aa C + d
a q aa C - a
Now suppose parent contributes a
Mean(a transmitted) = p(C + d) + q(C - a) = C - qa + pd
αa = Mean(a transmitted) - µ = -p[a + d(q-p)]
23
α, the average effect of an allelic substitution
• α = αA - αa is the average effect of an allelic substitution, the change in mean trait value when an a allele in a random individual is replaced by an A allele – α = a + d(q-p). Note that
• αA = qα and αa =-pα. • E(αX) = pαA + qαa = pqα - qpα = 0, • The average effect of a random allele is zero,
hence average effects are deviations from the mean
24
Dominance deviations • Fisher (1918) decomposed the contribution
to the genotypic value from a single locus as Gij = µ + αi + αj + δij – Here, µ is the mean (a function of p) – αi are the average effects – Hence, µ + αi + αj is the predicted genotypic
value given the average effect (over all genotypes) of alleles i and j.
– The dominance deviation associated with genotype Gij is the difference between its true value and its value predicted from the sum of average effects (essentially a residual)
25
Fisher’s (1918) Decomposition of G One of Fisher’s key insights was that the genotypic value consists of a fraction that can be passed from parent to offspring and a fraction that cannot.
Mean value µG = Σ Gij Freq(AiAj)
Average contribution to genotypic value for allele i
Consider the genotypic value Gij resulting from an AiAj individual
In particular, under sexual reproduction, parents only pass along SINGLE ALLELES to their offspring
Gij = µG + αi + αj + δij
26
Since parents pass along single alleles to their offspring, the αi (the average effect of allele i) represent these contributions
The genotypic value predicted from the individual allelic effects is thus
The average effect for an allele is POPULATION- SPECIFIC, as it depends on the types and frequencies of alleles that it pairs with
Gij = µG + αi + αj + δij
Gij = µG + αi + αj ^
27
Dominance deviations --- the difference (for genotype AiAj) between the genotypic value predicted from the two single alleles and the actual genotypic value,
Gij = µG + αi + αj + δij
The genotypic value predicted from the individual allelic effects is thus Gij = µG + αi + αj
^
Gij - Gij = δij ^
28
Gen
otyp
ic V
alue
N = # Copies of Allele 2 0 1 2
G11
G21
G22
µ + 2α1
µ + α1 + α2
µ + 2α2
δ12
δ11
δ22
Slope = α = α2 - α1
1
α
11 21 22 Genotypes
29
Average Effects and Additive Genetic Values
A ( G ij ) = αi +
The α values are the average effects of an allele
A key concept is the Additive Genetic Value (A) of an individual
A is called the Breeding value or the Additive genetic value
αi(k) = effect of allele i at locus k
A ( G ij ) = αi + αj
30
Why all the fuss over A?
Suppose pollen parent has A = 10 and seed parent has A = -2 for plant height
Expected average offspring height is (10-2)/2 = 4 units above the population mean. Offspring A = average of parental A’s
KEY: parents only pass single alleles to their offspring. Hence, they only pass along the A part of their genotypic value G
31
Genetic Variances Writing the genotypic value as
The genetic variance can be written as
This follows since
Gij = µG + (αi + αj) + δij
As Cov(α,δ) = 0
32
Genetic Variances
σ2 G =
2 A +
2 D
Additive Genetic Variance (or simply Additive Variance) Dominance Genetic Variance
(or simply dominance variance)
Hence, total genetic variance = additive + dominance variances,
σ σ
33
Key concepts (so far) • αi = average effect of allele i
– Property of a single allele in a particular population (depends on genetic background)
• A = Additive Genetic Value (A) – A = sum (over all loci) of average effects – Fraction of G that parents pass along to their offspring – Property of an Individual in a particular population
• Var(A) = additive genetic variance – Variance in additive genetic values – Property of a population
• Can estimate A or Var(A) without knowing any of the underlying genetical detail (forthcoming)
34
One locus, 2 alleles:
Q1Q1 Q1Q2 Q2Q2
0 a(1+k) 2a
When dominance present, Additive variance is an asymmetric function of allele frequencies
Since E[α] = 0, Var(α) = E[(α -µa)2] = E[α2]
35
Q1Q1 Q1Q2 Q2Q2
0 a(1+k) 2a
This is a symmetric function of allele frequencies
Dominance variance
Can also be expressed in terms of d = ak
36
Additive variance, VA, with no dominance (k = 0)
Allele frequency, p
VA
37
Complete dominance (k = 1)
Allele frequency, p
VA
VD
38
Epistasis
These components are defined to be uncorrelated, (or orthogonal), so that
39
Additive x Additive interactions -- αα, AA interactions between a single allele at one locus with a single allele at another
Additive x Dominance interactions -- αδ, AD interactions between an allele at one locus with the genotype at another, e.g. allele Ai and genotype Bkj
Dominance x dominance interaction --- δδ, DD the interaction between the dominance deviation at one locus with the dominance deviation at another.
40
Effects and Variance when using a testor
• A common design in plant breeding is to cross members from a population to a testor to generate a testcross. – Testor can be either an inbred or an outcrossing population – Often from a different heteroic group from the population
being tested – Often testor is an elite genotype
• The average effect of an allele in a testcross, its variance, and its additive (General combining ability, GCA) and interaction (Specific combining ability, SCA) effects all follow in analogous fashion to previous results for crosses within a population
41
• The concept of the average effect of an allele when crossed within its population is easily extended to the average effect of an allele when crossed to a testor. – Called the testcross average effect.
• The average effect of allele X in this testcross, αxT , is
defined as difference between the mean value of offspring getting this allele from the population versus the mean value of a random offspring from this cross – Will turn out to be a function of the frequencies of alleles in
both the tested and the testor population.
The average effect of an allele in a testcross
42
Mean value for a testcross Suppose the frequency of A is p in the population and pT in the testor (with q and qT similarly defined for a).
A (pT) a (qT)
A (p) ppT C + a
pqT C + d
a (q) qpT C + d
qqT
C - a
testor
Pare
ntal
line
Mean of cross = C + a(ppT - qqT) + d(pqT + qpT)
43
Average testcross mean in a series of RILs
• Slide 9 gave an expression for the expected average performance from a series of RILs formed by crossing two populations.
• A similar expression exists for the average testcross performance for a series of RILs from a cross of A x B – Mean = (1/2) µA
T + (1/2) µBT, namely the average of the
testcross means for A and B – More generally (since lines can, by chance, have equal
contribution of alleles), – Mean = πA µA
T + πB µBT, where πA = (1- πB) is the fraction of
alleles from A in the sample if RILS
– Can use molecular markers to estimate the πx directly.
44
αAT, testcross effect of allele A
Allele from other parent
Probability Genotype Value
A pT AA C + a
a qT Aa C + d
Suppose parent contributes A
Mean(A transmitted) = pT(C + a) + qT(C + d) = C + pTa + qTd
αAT = Mean(A transmitted) - µ = q[a + d(qT-pT)]
αaT = Mean(a transmitted) - µ = -p[a + d(qT-pT)]
Likewise,
45
αT, the average testcross effect of an allelic substitution
• αT = αAT - αa
T is the average testcross effect of an allelic substitution, the change in mean trait value when an a allele in a random testcrossed individual is replaced by an A allele – αT = a + d(qT-pT). Note that this is
independent of the allele frequencies in the parental population, and depends ONLY on the testor allele frequencies.
• αAT = qαT , αa
T = -pαT, and E(αxT) = 0
46
Testcross variance • Just as the additive genetic variance was the population
variance in the sum of the average effects of an allele, the testcross variance is variance in the average testcross effects of a random allele – Var(AT) = Var(αx
T) = Var(αxT)
– Var(αxT) = p (αA
T)2 + q (αaT)2 =
– p(q[a + d(qT-pT)])2 + q(-p[a + d(qT-pT)])2 • = pq[a + d(qT-pT)]2
– Hence, Var(αxT) = pq[a + d(qT-pT)]2
47
GCS and SCA • Consider a cross between individuals from
population 1 and population 2 • Let µ1 x 2 denote the average value for all of
these crosses, and let Gij be the average genotypic value of an individual from a cross from individual i (or line) in population one and individual j (or line) from population two.
• Analogous to Fisher’s decomposition, we can write this in terms of two additive effects and one interaction effect.
48
αi2 is the testcross average effect for allele i (more
generally an allele from individual i) when tested using population 2 as a testor, with αj
1 similarly defined for allele j (from pop 2) using one as the testor
is the interaction between allele i from and allele j in the testcross of 1 and 2
The sum over all loci of the αi2 values is the general
combining ability (GCA) of line i when crossed to line 2 (note these are cross-specific)!
The sum of the δ is the specific combining ability (SCA)
49
Gij = µ + GCAi2 + GCAj
1 + SCAij
12 The superscripts denoting the population in which the allele is being tested is often suppressed
The GCA is akin to the breeding value from one parent, but now it is the testcross value of that parent
The predicted mean of a particular cross is the sum of the two GCAs for those individuals/lines
As with average effects and dominance deviations, these are only defined with respect to a particular reference set of crosses (I.e., lines from Pop 1 X lines from pop 2)
50
Within-population crosses vs. testors
Within-pop testor
Allelic effects α αT
Additive transmitting factor Breeding value A GCA
Predicting offspring mean A1/2 +A2/2 GCA1 + GCA2
Nonadditive component Dominance value SCA
Genetic Variances Var(A), Var(D) Var(GCA), Var(SCA)
1
Lecture 5: Resemblance Between
Relatives Bruce Walsh Notes
Introduction to Plant Quantitative Genetics Tucson. 8-10 Jan 2018
2
Heritability • Central concept in quantitative genetics • Fraction of phenotypic variance due to
additive genetic values (Breeding values) – h2 = VA/VP
– This is called the narrow-sense heritability – Phenotypes (and hence VP) can be directly
measured – Breeding values (and hence VA) must be
estimated • Estimates of VA require known collections of
relatives
3
Broad-sense heritability
• Narrow-sense heritability h2 applies when outcrossing, – h2 = Var(A)/Var(P) – = the fraction of all trait variation due to variation
in breeding (additive genetic) values • Broad-sense heritability H2 applies when
selecting among a series of pure lines – H2 = Var(G)/Var(P) – = the fraction of all trait variation due to variation
in Genotypic values
4
Defining H2 for Plant Populations Plant breeders often do not measure individual plants (especially with pure lines), but instead measure a plot or a block of individuals.
This replication can result in inconsistent measures of H2 even for otherwise identical populations.
Effect of the k-th plot deviations of individual plants within this plot
Let zijkl denote the value of the l-th replicate in plot k of genotype i in environment j. We can decompose this value as
zijkl = Gi + Ej + GEij + pijk + eijkl
5
If we set our unit of measurement as the average over all plots, the phenotypic variance for the mean of line i becomes
Thus, VP, and H2 = VG/VP, depend on our choice of e, r, and n
σ2 ( ) = σ2 G + σ2 E + σ2 G E e + σ2 p e r +
σ2 e e r n
Suppose we replicate the genotype over e environments, with r plots (replicates) per environment, and n individuals per plot.
In order to compare board-sense heritabilities we need to use a consistent design (same values of e, r, and n)
zi
6
Key observations • The amount of phenotypic resemblance
among relatives for the trait provides an indication of the amount of genetic variation for the trait.
• If trait variation has a significant genetic basis, the closer the relatives, the more similar their appearance
• The covariance between the phenotypic value of relatives measures the strength of this similarity, with larger Cov = more similarity
7
Genetic Covariance between relatives
Genetic covariances arise because two related individuals are more likely to share alleles than are two unrelated individuals.
Sharing alleles means having alleles that are identical by descent (IBD): both copies can be traced back to a single copy in a recent common ancestor.
Father Mother
8
Father Mother
No alleles IBD One allele IBD
Both alleles IBD
9
ANOVA: Analysis of variation • Partitioning of trait variance into within- and among
-group components • Two key ANOVA identities
– Total variance = between-group variance + within-group variance
• Var(T) = Var(B) + Var(W)
– Variance(between groups) = covariance (within groups)
– Intraclass correlation, t = Var(B)/Var(T) • The more similar individuals are within a group (higher within
-group covariance), the larger their between-group differences (variance in the group means)
10
4 3 2 1 4 3 2 1
Situation 1
Var(B) = 2.5 Var(W) = 0.2 Var(T) = 2.7
Situation 2
Var(B) = 0 Var(W) = 2.7 Var(T) = 2.7
t = 2.5/2.7 = 0.93 t = 0
11
Why cov(within) = variance(among)? • Let zij denote the jth member of group i.
– Here zij = u + gi + eij – gi is the group effect – eij the residual error
• Covariance within a group Cov(zij,zik ) – = Cov(u + gi + eij, u + gi + eik) – = Cov(gi, gi) as all other terms are uncorrelated – Cov(gi, gi) = Var(g) is the among-group variance
12
Resemblance between relatives and variance components
• The phenotypic covariance between relatives can be expressed in terms of genetic variance components – Cov(zx,zy) = axyVA + bxyVD. – The weights a and b depend on the nature of the
relatives x and y, and are measures of how often they are expected to share alleles identical by descent
– These are critical in predicting selection response
13
Parent-offspring genetic covariance Cov(Gp, Go) --- Parents and offspring share EXACTLY one allele IBD
Denote this common allele by A1
G p = A p + D p = α1 + αx + D 1 x
G o = A o + D o = α1 + αy + D 1 y
IBD allele Non-IBD alleles
15
Hence, relatives sharing one allele IBD have a genetic covariance of Var(A)/2
The resulting parent-offspring genetic covariance becomes Cov(Gp,Go) = Var(A)/2
16
Half-sibs
1
o 1
2
o 2
The half-sibs share no alleles IBD • occurs with probability 1/2
Each sib gets exactly one allele from common father, different alleles from the different mothers
Hence, the genetic covariance of half-sibs is just (1/2)Var(A)/2 = Var(A)/4
17
Full-sibs Father Mother
Sib 1
Prob(Allele from father IBD) = 1/2. Given the allele in parent one, prob = 1/2 that sib 2 gets same allele
Each sib gets exact one allele from each parent
Sib 2
Prob(Allele from father not IBD) = 1/2. Given the allele in parent one, prob = 1/2 that sib 2 gets different allele
18
Full-sibs Father Mother
Full Sibs Paternal allele not IBD [ Prob = 1/2 ] Maternal allele not IBD [ Prob = 1/2 ] Prob(sibs share 0 alleles IBD) = 1/2*1/2 = 1/4
Each sib gets exact one allele from each parent
19
Father Mother
Full Sibs
Paternal allele IBD [ Prob = 1/2 ] Maternal allele IBD [ Prob = 1/2 ] Prob(sibs share 2 alleles IBD) = 1/2*1/2 = 1/4
Each sib gets exact one allele from each parent
Prob(share 1 allele IBD) = 1-Pr(0) - Pr(2) = 1/2
20
I BD al l el es P rob a bil i ty Co n tr i but i on
0 1/ 4 0
1 1/ 2 V a r ( A ) / 2
2 1/ 4 V a r ( A ) + Va r( D )
Resulting Genetic Covariance between full-sibs
Cov(Full-sibs) = Var(A)/2 + Var(D)/4
21
Genetic Covariances for General Relatives
Let r = (1/2)Prob(1 allele IBD) + Prob(2 alleles IBD)
Let u = Prob(both alleles IBD)
General genetic covariance between relatives Cov(G) = rVar(A) + uVar(D)
When epistasis is present, additional terms appear r2Var(AA) + ruVar(AD) + u2Var(DD) + r3Var(AAA) +
22
More general relationships
• To obtain the expected covariance for any set of relatives, we normally need only compute r and u for that set of relatives
• With general inbreeding, becomes more complex (as three other terms, in addition to VA and VD arise --- not discussed here, see WL chapter 11 for details)
• With crosses involving inbred and/or related parents, values for r and u are different from those presented above.
23
Coefficients of Coancestry Suppose we pick a single allele each at random from two relatives. The probability that these are IBD is called Θ, the coefficient of coancestry
Θxy denotes the coefficient for relatives x and y
Consider an offspring z from a (hypothetical) cross of x and y. Θxy = fz, the inbreeding coefficient of z. Why? Because the offspring of x and y each get a randomly-chosen allele from each parent. The probability fz that both alleles are IBD (the probability of inbreeding) is thus just Θxy.
24
θ and the coefficient on VA • The coefficient on the additive variance for
the relatives x and y is just 2θxy. • To see this,
– let AiAj denote the two alleles in x and AkAl those in y.
– Cov(breeding values) = Pr(Ai ibd Ak)cov(αi, αk) + Pr(Ai ibd Al)cov(αi,αl) + Pr(Aj ibd Ak)cov(αj, αk) + Pr(Aj ibd Al)cov(αj,αl) = 4 θxyVar(α)
– Since Var(A) = 2Var(α), Cov = 2 θxyVar(A)
25
Θxx : The Coancestry of an individual with itself
Self x, what is the inbreeding coefficient of its offspring?
To compute Θxx, denote the two alleles in x by A1 and A2
Draw A1
Draw A1 Draw A2
Draw A2
IBD
IBD
Hence, for a non-inbred individual, Θxx = 2/4 = 1/2
If x is inbred, fx = prob A1 and A2 IBD,
fx
fx
Θxx = (1+ fx)/2
26
Example B A D C
E F
G
Consider the following pedigree Suppose A and D are fully-inbred, and related, lines with θAD = 0.5. Further, B and C are unrelated and outcrossed individuals
Individual A B C D
Fx 1 0 0 1
θxx = (1 + Fx)/2 1 1/2 1/2 1
27
The Parent-offspring Coancestry Let A1, An denote the two alleles in the offspring, where An is the allele from the nonfocal parent (NP), while A1,Ap are the two alleles in the focal parent (P)
Draw A1
Draw A1 Draw An
Draw Ap
IBD
ΘP,NP
For a non-inbred individual, ΘP0 = 1/4
fp
ΘPO = (1 + fp + 2ΘP,NP)/4 = (1 + fp + 2fo)/4
Offspring
Pare
nt
A1, Ap IDB if parent is inbred
Prob(An,Ap), the alleles from the two parents are IBD, i.e. , offspring is inbred
ΘP,NP
General:
28
B A D C
E F
G
From before
θAA= θDD = 1; θBB = θCC = 1/2; θAD = 1/2, θAB = θAC = θBC = θBD = θCD = 0
Consider A - E (inbred parent - offspring) θAE = (1+fA)/4 = (1+1)/4 = 1/2. Same value for θDF
Consider B - E (outbred parent - offspring) θBE = (1+fB)/4 = (1+0)/4 = 1/4. Same value for θCF
Consider E - G (outbred parent - offspring) θEG = (1+fE+2θEF)/4 = (1+0+2[1/8])/4 = 5/16. Same value for θFG
29
B A D C
E F
G
From before
θAA= θDD = 1; θBB = θCC = 1/2; θAD = 1/2, θAB = θAC = θBC = θBD = θCD = 0
What about θEF ?
The randomly-chosen allele from E has equal chance of being from A or B. Likewise for F (from C or D)
Of these four possible combinations (A&C, A&D, B&C, B&D), only an allele from A and an allele from D have a chance of being IBD, which is θAD = 1/2.
Hence, θEF = θAD /4 = 1/8
30
m f
1/2 1/2
(1/2)(1/2)(1/2) (1/2)(1/2)(1/2)
Θ = 1/8 + 1/8 = 1/4
m f
(1+fm)/2 (1+ff)/2
[(1 +fm )/2] (1/2)(1/2) [(1 +ff )/2] (1/2)(1/2)
Θ =(2 + fm+ ff)/8
Full sibs (x and y) from parents m and f
Unrelated, non-inbred parents
Unrelated, inbred parents
31
m f
Θ mf
Θ mf /4
Full sibs (x and y) from parents m and f
m f
Θ mf
Θ mf (1/2)(1/2)
This gives Θ = (2+fm+ff +4Θ mf)/8 = (2+fm+ff +4fo)/8
Parents inbred & related. Two additional paths to add to Θ =(2+fm+ff)/8
32
Full sibs (x and y) from parents m and f
Θxy = (2 + fm + ff + 4Θmf)/8
f m
x y
s f d f s m d m
ff = Θsf,df fm = Θsm,dm
Θxy = (2 + Θsm,dm + Θsf,df + 4Θmf)/8
Putting all this together gives
33
B A D C
E F
From before
θAA= θDD = 1; θBB = θCC = 1/2; θAD = 1/2, θEF = 1/8, θAB = θAC = θBC = θBD = θCD = 0
S1,S2
θS1S2 = (2 + 0 + 0 + 4[1/8])/8 = (4 + 1)/16 = 5/16
Θxy = (2 + ΘAB + ΘCD + 4ΘEF)/8
Example
34
Half-sibs
• Using the same arguments as above, θEF = (θAA + θAB + θAC + θBC)/4 = ([1 + fA]/2 + θAB + θAC + θBC)/4 Hence, if B and C unrelated, θEF = (1 + fA)/8
A B C
E F
A is the common parent
35
Computing θxy -- The Recursive Method • There is a simple recursive method for generating the elements Aij
= 2 θij of a relationship matrix (used for BLUP selection). For ease of reading, we use the notation A(i,j) = Aij – Basic idea is that the founding individuals of the pedigree are
assumed to be unrelated and not inbred (although this can also be accommodated). These founders are assigned values of A(i,i) = 1.
– Likewise, any unknown parent of any future individual is assumed to be unrelated to all others in the pedigree and not inbred, and they are also assigned a value of A(i,i) = 1.
– Let Si and Di denote the sire and dam (father and mother) of individual i. For this offspring A(i,i) = 1 + A(Si, Di)/2
– A(i,j) = A(j,i) = [A(j,Si) + A(j,Di)]/2 = [A(i,Sj) + A(i,Dj)]/2 – The recursive (or tabular) method starts with the founding parents and
then proceeds down the pedigree in a recursive fashion to fill out A for the desired pedigree.
36
Example
1
2 3 4 5
6 7 8
9 10
11
Ancestors are 1 & 2
A(1,1) = A(2,2) = 1 A(1,2) = 0
3: S3 = 1, D3 = Unknown, A(3,3) = 1 + A(S3,D3)/2 = 1 + A(1,unk)/2 = 1 A(1,3) = [A(1,S3) + A(1,D3)]/2 = [A(1,1) + A(1,unk)]/2 = 1/2. Note also that A(1,4) = A(1,5) = 1/2, A(4,4) = A(5,5) = 1. A(3,4) = [A(3,S4) + A(3,D4) ]/2 = [A(3,1) + A(3,unk)]/2 = (1/2+0)/2 = 1/4. Same for A(3,5) = 1/4. 2 is unrelated to 3, 4, 5, giving A(2,3) = A(2,4) = A(2,5) = 0.
3, 4, 5, 8 all have unknown parents (only a single arrow to them)
37
1
2 3 4 5
6 7 8
9 10
11
So far
6: S6 = 2, D6 = 3. A(6,6) = 1 + A(S6, D6)/2 = 1 + A(2,3)/2 = 1 A(6,1) = [A(1, S6) + A(1, D6)]/2 = [A(1,2) + A(1,3)]/2 = [0 + 1/2]/2 = 1/4 A(6,2) = [A(2, S6) + A(2, D6)]/2 = [A(2,2) + A(2,3)]/2 = [1+ 0]/2 = 1/2 A(6,3) = [A(3, S6) + A(3, D6)]/2 = [A(3,2) + A(3,3)]/2 = [0 + 1]/2 = 1/2 A(6,4) = [A(4, S6) + A(4, D6)]/2 = [A(4,2) + A(4,3)]/2 = [0 + 1/4]/2 = 1/8 A(6,5) = [A(5, S6) + A(5, D6)]/2 = [A(5,2) + A(5,3)]/2 = (0+1/4)/2 = 1/8
7: S7 = 2, D7 = 4. A(7,7) = 1 + A(S7, D7)/2 = 1 + A(2,4)/2 = 1 + 0/2 = 1 A(6,7) = [A(6, S7) + A(6, D7)]/2 = [A(6, 2) + A(6, 4)]/2 = (1/2 +1/8)/2 = 5/16
8: S8 = 5, D8 = unk. A(8,8) = 1 + A(S8, D8)/2 = 1 + A(5,unk)/2 = 1. A(6,8) = [A(6, S8) + A(6, D8)]/2 = [A(6, 5) + A(6, unk)]/2 = (1/8)/2 = 1/16
9: S9 = 7, D9 = 6. A(9,9) = 1 + A(S9, D9)/2 = 1 + A(6,7)/2 = 1 + 5/32 = 1.156 <- inbred!
Actual relatedness versus expected values from pedigrees
38
Values for the coefficient of coancestry (θ) and the coefficient of fraternity (Δ) obtained from pedigrees are expected values. Due to random segregation of genes from parents, The actual value (or realization) can be different. For example, we expect 2θ to be ½ for full subs. However, one pair of sibs may actually be more similar (0.6) and another less similar (say 0.35). On average, θ is ½ for pairs of sibs, but if we knew the actual value of θ, we have more information. With sufficient dense genetic markers, we can estimate these relationships directly.
Genomic selection uses this extra information.
What about coefficient of coancestry θ ?
39 39
40
Indiv x: 00 00 10 10 00 10 11 00 11 00
Indiv y: 10 00 11 11 10 11 11 10 11 10
Locus-specific θ
0.5 1.0 0.5 0.5 0.5 0.5 1.0 0.5 1.0 0.5
Estimated θ is the average over all ten loci, = 0.65
41
The coefficient of fraternity • While (twice) the coefficient of coancestry gives the
weight on the additive variance for two relatives, a related measure of IDB status among relatives gives the weight on the dominance variance
• The probability that the two alleles in individual x are IBD to two alleles in individual y is denoted Δxy, and is called the coefficient of fraternity.
• This can be expressed as a function of the coefficients of coancestry for the parents of (mx and fx) of x and the parents (my and fy) of y. – Δxy = θmxmyθfxfy+ θmxfyθfxmy
42
The coefficient of fraternity • While (twice) the coefficient of coancestry gives the
weight on the additive variance for two relatives, a related measure of IDB status among relatives gives the weight on the dominance variance
• The probability that the two alleles in individual x are IBD to two alleles in individual y is denoted Δxy, and is called the coefficient of fraternity.
• This can be expressed as a function of the coefficients of coancestry for the parents of (mx and fx) of x and the parents (my and fy) of y. – Δxy = θmxmyθfxfy+ θmxfyθfxmy
43
The coefficient of fraternity (cont)
• x and y can have both alleles IBD if – The allele from the father (fx) of x and the father (fy) of y are
IDB (probability θfxfy) AND the allele from the mother (mx) of x and the mother (my) of y are IDB (probability θmxmy) , or θfxfy θmxmy
– OR the allele from the mother (mx) of x and the father (fy) of y are IDB (probability θmxfy) AND the allele from the father (fx) of x and the mother (my) of y are IDB (probability θfxmy) , or θmxfy θfxmy
– Putting these together gives • Δxy = θmxmyθfxfy+ θmxfyθfxmy
44
x y
fx fy mx my
Δxy = θmxmyθfxfy + θmxfyθfxmy
θmxmy θfxfy
θmxfy
θfxmy
Δxy, The Coefficient of Fraternity
Δxy = Prob(both alleles in x & y IBD)
45
Examples of Δxy: Full sibs • Full sibs share same mon, dad
– mx = my = m, fx = fy = f – Δxy = θmxmyθfxfy + θmxfyθfxmy = θmmθff + θmf
2
– Δxy = (1+fm)(1+ff)/4 + θmf2
• If parents unrelated, θfm = 0, giving – Δxy = (1+fm)(1+ff)/4
• If parents are unrelated and not inbred, – Δxy = 1/4
46
Examples of Δxy: Half sibs • Paternal half sibs share same dad, different
moms – fx = fy = f; mx and my – Δxy = θmxmyθfxfy + θmxfyθfxmy = θmxmyθff + θmxf θmyf
– Δxy = θmxmy (1+fm)/2 + θmxf θmyf
• If mothers are unrelated to each other and to the common father, θmxmy = θmxf = θmyf = 0, giving – Δxy = 0
47
When is Δ non-zero? • Since Δxy = θmxmyθfxfy + θmxfyθfxmy • A nonzero value for Δ requires either
– That the fathers of both x and y are related AND the mothers of both x and y are related
– OR that the father of x is related to the mother of y AND the mother of x is related to the father of y
48
B A D C
E F
From before
θAA= θDD = 1; θBB = θCC = 1/2; θAD = 1/2, θEF = 1/8, θAB = θAC = θBC = θBD = θCD = 0
S1,S2
What is Δ for the full sibs (S1 and S2)?
Δxy = θmxmyθfxfy + θmxfyθfxmy = θEEθFF + θEF2
Giving Δxy = θEEθFF + θEF2
= (1/2)(1/2) + (1/8)2 = 1/4 + 1/64 = 17/64 = 0.266
49
Δxy and the coefficient on VD
• The coefficient on the dominance variance for the relatives x and y is just Δxy.
• To see this, – let AiAj denote the two alleles in x and AkAl those
in y. – Suppose that alleles i and k come from the
mothers of these two relatives and alleles j and l from their fathers.
– Cov(dominance values) = Pr(Ai ibd Ak; Aj ibd Al ) cov(δij, δkl) + Pr(Ai ibd Al; Aj ibd Ak)cov(δij, δkl)
– = (θfxfyθmxmy + θmxfyθjxmy) Var(D) = Δxy Var(D)
50
General Resemblance between relatives
51
Example B A D C
E F S1,S2
We found for full sibs S1, S2 that θ = 5/16, hence 2θ = 5/8; Δ = 17/64
Expected genetic covariance between this sibs is
(5/8)Var(A) + (17/64)Var(D) + (5/8)2Var(AA) + (5/8) (17/64)Var(AD) + (17/64) 2Var(DD) + …
52
Covariance among selfed lines • A common situation in plant breeding is that two
inbred lines are crossed – the frequency of any segregating allele in the F1/F2 is 1/2 – Starting with the F1, a series of lines is formed by selfing, Sk
= F2+k. – The covariance among the various Sk lines is important in
the response to selection in selfed lines – In particular, we want to covariance between an Sj and an Sk
line whose last common ancestor was the Si, with i < j < k
53
Autotetraploids • Peanut, Potato, alfalfa, soybeans all examples
of crops with at least some autotetraploid lines
• With autotetraploid, four alleles per locus, with a parent passing along two alleles to an offspring
• As a result, a parent can pass along the dominance contribution in G to an offspring
• Further, now there are four variance components assocated with each locus
54
Genetic variances for autotetraploids
• G = A + D + T + Q – A (additive) and D (dominance, or digenic effects)
as with diploids – T (trigenic effects) are the three-way interactions
among alleles at a locus – Q (quadrigenic effects) are the four-way
interactions at a locus • Total genetic variance becomes
– VG = VA + VD + VT + VQ
55
Resemblance between autotetraploid relatives
Relatives VA VD VT VQ
Half-sibs 1/4 1/36
Full-sibs 1/2 2/9 1/12 1/36
Parent-offspring 1/2 1/6
Assumes unrelated, non-inbred parents
.
Module 1 Lecture 6 1
Lecture 6 Heritability and Field Design
Lucia Gutierrez lecture notes
Tucson Plant Breeding Institute
January 2018 Tucson, Arizona
1
Module 1: Lecture 6
Selection Response
0µSµ
Y)(selection toresponse R
X)( ldiferentiaselection S
pointn truncatio c
sindividual selected ofprogeny a ofmean
population R.M. from sindividual selected a ofmean
population Mating Random initial theofmean
1
0
∆=∆=
====
µµµ
S
1µ
S
R
2,
,
)(
hS
Rb
SbR
XbY
yx
yx
==
=∆=∆
Heritability: “Expected proportion of selection differential to be achieved as a gain from selection” (Hanson, 1963).
Holland et al. 20102
c
Module 1: Lecture 6
.
Module 1 Lecture 6 2
Heritability
REALIZED HERITABILITYThe relation between the observed response to selection and the selection differential.
NARROW SENSE HERITABILITYThe fraction of all trait variation due to variation in breeding values (additive variance).
BROAD SENSE HERITABILITYThe fraction of all trait variation due to variation in genotypic values (genetic variance).
P
A
V
Vh =2
P
G
V
VH =2
S
Rh
ˆ
ˆ2 =
3
Holland et al. 2010 Module 1: Lecture 6
Heritability
ADDITIVE (OR GENETIC) VARIANCEWe can estimate it from groups of related individuals:* Covariance Parent – Offspring: ½ VA* Covariance Half-sibs: ¼ VA* Covariance Full-sibs: ½ VA + ¼ VD .
TOTAL PHENOTYPIC VARIANCEVP = VG + VEEvaluate with proper designs
P
A
V
Vh =2
P
G
V
VH =2
4
Holland et al. 2010 Module 1: Lecture 6
.
Module 1 Lecture 6 3
Heritability
FUNCTION OF POPULATION
o Heritability is a function of both genetic and environmental variance, therefore a property of the population.
o Suppose one inbred line with a Mendelian inherited trait. How much is the genetic variance? Will the trait segregate? How much is the h2? Does it mean that the trait is not genetically determined? Since there is no genetic variation within this population, h2=H2=0. However, it does not mean that the trait is not genetically determined.
o Additionally, heritability is unreliable to predict future response to selection because while conducting selection the population changes.
5
Holland et al. 2010 Module 1: Lecture 6
Heritability
ENVIRONMENTAL VARIANCE
Since h2 is also a function of environmental variance, and decreasing environmental variance increases h2, controlled conditions would be optimal for identifying superior genotypes (predicting breeding values). However, caution should be exercised because GxE is important for many traits and therefore selecting in a non-targeted environment could be detrimental.
6
Holland et al. 2010 Module 1: Lecture 6
.
Module 1 Lecture 6 4
Parent-Offspring Regression
( ) ipipooi ezbz +−+= µµ |
deviationse
offspringth i ofparent of phenotypez
slope offspring-parent regression b
mean populationμ
offspringth -i theof phenotypez
i
pi
po|
oi
=
−=
==
=
LW 17
Offspring
Parent
-2 0
-1 5
-1 0
-5
0
5
10
15
20
-2 0 -1 5 -1 0 -5 0 5 10 15 20
7
Module 1: Lecture 6
Parent-Offspring Regression
( ) ( )( )
( )po
z
AAApo bhh
EpEobE |
222
24
122
1
p2
po| 2 ,
2
1,
z
z,zˆ ≅≅++≅=σ
σσσσ
σ
Regression one parent on offspring – no environment correlation among parent and offspring.
( ) ( )( )
( )( ) po
z
AAApo bhhbE |
222
21
24
122
1
P221
P1212
P221
P121
o
p2
po| ,
zz
zz,z
z
z,zˆ ≅=+=++==
σσσ
σσ
σσ
Regression one parent – offspring (one offspring or the mean of multiple offspring).
( ) ( )( ) po
z
AAApo bhhbE |
222
24
122
1
p2
po| 2 ,
2
1
z
z,zˆ ≅≅+≅=σ
σσσ
σ
Regression mid parent on offspring – no environment correlation among parent and offspring.
( ) ( )( ) 0|1:0
222
2212
122
12
S02
1:S0S00|1:0 ,
z
z,zˆSS
z
AADDASS bhhbE ≅≅+++==
σσσσσ
σσ
Regression parent – offspring inbreeding – no environment correlation.
LW 178
Module 1: Lecture 6
.
Module 1 Lecture 6 5
Mating Designs
FULL-SIB DESIGN: N full-sib families with n offspring each.
ijiij wfz ++= µ
iancefamily var-ithin w
on)contributi talenvironmen dominance, on,(segregatierror residualw
familyth -i theofeffect f
mean populationμ
familyth -i theof offspringth -j theof phenotypez
ij
i
ij
===
=
SoV df SS MS EMS
Among-families N-1 SSf/df(f)
Within-families N(n-1) SSw/df(w)
( )∑ −=i if zznSS 2
...
∑ −=ji iijw zzSS
,
2. )( 2
)(FSwσ
22)( fFSw nσσ +
LW 189
Module 1: Lecture 6
Mating Designs
FULL-SIB DESIGN: N full-sib families with n offspring each.
ijiij wfz ++= µ
( )( )( ) ( ) ( ) ( )2
, ,, ,
,
,)(
f
ijijiijijiii
ijiiji
ijij
wwfwwfff
wfwf
zzFSCov
σ
σσσσµµσ
σ
=
+++=
++++=
=
224
122
12EcDAf σσσσ ++=
( )( )
2224
322
1
224
122
1222
224
122
122)(
EcEDA
EcDAEDA
EcDAPFSw
σσσσσσσσσσ
σσσσσ
+++=
++−++=
++−=
2)(
22FSwfP σσσ +=
LW 1810
Module 1: Lecture 6
.
Module 1 Lecture 6 6
Mating Designs
FULL-SIB DESIGN: N full-sib families with n offspring each.
( )( )( ) ( ) ( )wfz
w
f wf
VarVarVar
MSVarn
MSMSVar
w
+==
−=
( )( )
( )[ ] [ ])1(
211)1(2)(
2
2
1
Var
Var
2
2
2
224
12
2
224
122
1
−−+−≅
≅
++=++==
nNntnthSE
th
hz
ft
FSFS
FS
z
EcD
z
EcDAFS σ
σσσ
σσσ
ijiij wfz ++= µ
SoV df SS MS EMS
Among-families N-1 SSf/df(f)
Within-families N(n-1) SSw/df(w)
( )∑ −=i if zznSS 2
...
∑ −=ji iijw zzSS
,
2. )( 2
)(FSwσ
22)( fFSw nσσ +
LW 1811
Module 1: Lecture 6
Mating Designs
HALF-SIB DESIGN: N half-sib families with n offspring each.
ijiij wfz ++= µ
iancefamily var-ithin w
on)contributi talenvironmen dominance, on,(segregatierror residualw
familyth -i theofeffect f
mean populationμ
familyth -i theof offspringth -j theof phenotypez
ij
i
ij
===
=
SoV df SS MS EMS
Among-families N-1 SSf/df(f)
Within-families N(n-1) SSw/df(w)
( )∑ −=i if zznSS 2
...
∑ −=ji iijw zzSS
,
2. )( 2
)(FSwσ
22)( fFSw nσσ +
LW 1812
Module 1: Lecture 6
.
Module 1 Lecture 6 7
Mating Designs
HALF-SIB DESIGN: N half-sib families with n offspring each.
ijiij wfz ++= µ
iancefamily var-ithin w
on)contributi talenvironmen dominance, on,(segregatierror residualw
familyth -i theofeffect f
mean populationμ
familyth -i theof offspringth -j theof phenotypez
ij
i
ij
===
=
SoV df SS MS EMS
Among-families N-1 SSf/df(f)
Within-families N(n-1) SSw/df(w)
( )∑ −=i if zznSS 2
...
∑ −=ji iijw zzSS
,
2. )( 2
)(FSwσ
22)( fFSw nσσ +
( )( )( ) ( ) ( )wfz
w
f wf
VarVarVar
MSVarn
MSMSVar
w
+==
−= ( )
( )HS
z
AHS
th
hz
ft
4
4
1
Var
Var
2
22
24
1
≅
===σσ
LW 1813
Module 1: Lecture 6
Mating Designs
NORTH CAROLINA DESIGN I: Each male (N sire) is mated to several unrelated females (M dams) to produce n offspring per dam.
ijkijiijk wdsz +++= )(µ
)deviations iancefamily var-(withinerror residualw
sireth -i the tomated damth -j theofeffect d
sireth -i theofeffect s
mean populationμ
damth -j and sireth -i theoffamily thefrom offspringth -k theof phenotypez
ijk
ij
i
ijk
=
===
=
LW 1814
Module 1: Lecture 6
.
Module 1 Lecture 6 8
Mating Designs
NORTH CAROLINA DESIGN I: Each male (N sire) is mated to several unrelated females (M dams) to produce n offspring per dam.
ijkijiijk wdsz +++= )(µ
SoV df SS MS EMS
Sires N-1 MSs/df(s)
Dams(Sire) N(M-1) MSd/df(d)
Sibs(dams) T-NM MSw/df(w)
( )∑ −=ji is zzMnSS
,
2...
∑ −=ji iijd zzSS
,
2. )(
222sdw Mnn σσσ ++
( )
( )( )( ) ( ) ( ) ( )wdsz
w
d
s
Wd
dS
VarVarVarVar
MSVarn
MSMSVar
Mn
MSMSVar
w
++==
−=
−=( )( )
( ) ( )( )
PHS
z
EcD
z
EcDAFS
z
APHS
th
hz
dst
hz
st
4
2
1
Var
VarVar
4
1
Var
Var
2
2
224
12
2
224
122
1
22
24
1
≅
++=++=+=
===
σσσ
σσσσ
σσ
∑ −=kji ijijkw zzSS
,,
2. )(
22dw nσσ +
2wσ
LW 1815
Module 1: Lecture 6
Mating Designs
NORTH CAROLINA DESIGN II: A group of sires (NS sires) are mated to an independent group of dams (ND dams) to produce n offspring.
ijkijjiijk wIdsz ++++= µ
)deviations iancefamily var-(withinerror residualw
damth -j theand sireth -i ebetween thn interactio theofeffect Iij
damth -j theofeffect d
sireth -i theofeffect s
mean populationμ
damth -j and sireth -i theoffamily thefrom offspringth -k theof phenotypez
ijk
i
i
ijk
=====
=
LW 2016
Module 1: Lecture 6
.
Module 1 Lecture 6 9
Mating Designs
NORTH CAROLINA DESIGN II: A group of sires (NS sires) are mated to an independent group of dams (ND dams) to produce n offspring.
ijkijjiijk wIdsz ++++= µ
SoV df SS EMS
Sires Ns-1
Dams
Interaction
Nd-1
(Ns-1)(Nd-1)
Sibs NsNd(n-1)
∑ +−−=ji jiijI zzzzSS
,
2........ )(
222sdIw nNn σσσ ++
2
24
1
z
APHSt
σσ=
∑ −=kji ijijkw zzSS
,,
2. )(
22Iw nσσ +
2wσ
( )∑ −=j jsd zznNSS 2
.....
( )∑ −=i ids zznNSS 2
.....
222dsIw nNn σσσ ++
2
2224
1
z
EcGmAMHSt
σσσσ ++=
2
24
1
z
DIt σ
σ=
LW 2017
Module 1: Lecture 6
Full Diallele (all selfed and reciprocal crosses are made)Incomplete Diallele – no selfed crosses Incomplete Diallele – no selfed, no reciprocal crosses
Mating Designs
DIALLELS: A group of individuals (N) are mated to the same set of individuals (N) to produce n offspring.
ijkijjiijk wsggz ++++= µ
)deviations iancefamily var-(withinerror residualw
th-j andth -i parents ofability combining specific s
th-jparent ofability combining general g
th-iparent ofability combining general g
mean populationμ
damth -j and parentsth -j andth -i thefrom offspringth -k theof phenotypez
ijk
ij
j
i
ijk
=
=
===
=
LW 2018
Module 1: Lecture 6
.
Module 1 Lecture 6 10
Mating Designs
DIALLELS: A group of individuals (N) are mated to the same set of individuals (N) to produce n offspring. Analysis for incomplete diallelewithout selfed or reciprocal crosses.
ijkijjiijk wsggz ++++= µ
SoV df SS EMS
GCA N-1
SCA N(N-3)/2
Sibs (n-1)[N(N-1)/2-1]
( ) 222 2 GCASGAw Nnn σσσ −++
2
24
1
z
AGCAt
σσ=
∑ <−=
kji ijijkw zzSS,
2. )(
2wσ
( ) GCAji ijSCA SSzznSS −−= ∑ <2
....
( ) ( )∑ −−−=
i iGCA zzN
NnSS 2
.....
2
2
1
22SCAw nσσ +
2
24
1
z
DSCAt
σσ=
LW 2019
Module 1: Lecture 6
Genotypic Means
GENOTYPIC MEANS:
ijklijklijjiijkl pGEEGz ε++++=
B820
Module 1: Lecture 6
The environment includes non-genetic factors that affect the phenotype, and usually has a large influence on quantitative traits affecting heritability and response to selection.
Micro-environment. Environment of a single plant. Need to be controlled with experimental design.
Macro-environment. Environment associated to alocation and time. GxE is the norm and not the exception in plants. Therefore defining the target environments is a crucial part in plant breeding, both for variance component estimation and identifying superior genotypes.
.
Module 1 Lecture 6 11
Micro-environment variation
HERITABILITY: For related individuals, heritability can be calculated as previously shown. The previous calculation assumes individual plants are measured, and heritability on an individual plant basis is estimated. However, because quantitative traits measured on individual plants have large non-genetic effects, heritability on a mean basis is higher.
22222weFEFP σσσσσ +++=
Individual plant basis, n=1, r=1, e=1
2222
2
2
2
2
ˆˆˆˆ
ˆ1
4
ˆ
ˆ1
4
ˆweFEF
FP
p
FP
F
FFh
σσσσ
σ
σ
σ
+++
+=
+=
Plot basis, n=n, r=1, e=1
n
hw
eFEF
F
p
FF 2
222
2
2
22
ˆˆˆˆ
ˆ
ˆ
ˆ
σσσσ
σσσ
+++==
nw
eFEFP
22222 σσσσσ +++=
Plot basis, multiple env, n=n, r=r, e=e
ernere
hweFE
F
F
p
FF 222
2
2
2
22
ˆˆˆˆ
ˆ
ˆ
ˆ
σσσσ
σσσ
+++==
ernereweFE
FP
22222 σσσσσ +++=
Holland et al., 2010, B621
Module 1: Lecture 6
Experimental Design and Analysis
So we need good experimental designs to test genotypes!
FISHER’S PRINCIPLES. Statistical theory for efficient estimation (i.e. unbiased and of minimum variance) of treatment mean differences are based on 3 principles:
– Randomization , random assignment of treatments to experimental units provides valid estimation of experimental error, unbiased comparisons of treatments, and justifies statistical inference.
– Replication allows estimation of experimental error variance, and decreases mean variances (i.e. variance of a mean = variance of the observation divided by the number of replications).
– Local control is the grouping of homogenous experimental units. Well chosen blocks will decrease error variance.
22
Module 1: Lecture 6
.
Module 1 Lecture 6 12
Experimental Design and AnalysisCLASSIFICATION OF EXPERIMENTAL DESIGNS: 1. Experimental Units.
1. Homogenous –Complete Randomized Design (CRD)2. Heterogenous in one way – Randomized Complete Block Design (RCBD)3. Heterogenous in more than one way – Latin squares or latinized designs.
2. Large number of treatments.1. Incomplete Block Designs (IBD or Alpha)2. Unreplicated experiments (or Federer)
3. Modeling (post-blocking, spatial analysis)
CRD RCBD IBD
13
12
243
4 14
42
213
3 14
42
213
3
ijiijy εαµ ++= ijjiijy εβαµ +++= 23ijkjkjiijky εγβαµ ++++= )(
Module 1: Lecture 6
Complete Randomized Design (CRD)
TREATMENT ASSIGNMENT: Each treatment is assigned atrandom to the experimental units.� Experimental unit: one plot.
RICE EXAMPLE. YIELD COMPARISON OF 4 RICE CULTIVARSTreatments: 4 cultivars
Replications: 4 homogenous experimental plots per treatment
Experimental design: CRD
Dependent variable : Y = Grain yield (Kg ha-1)
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
33
33
2 22
2
11
11
444
4
EXPERIMENTAL PRINCIPLES1. Randomization2. Replication3. Local control
24
Module 1: Lecture 6
Randomize in R: >design.crd(trt, r) where trt is a list of all treat ments (genotypes) and r the number of reps>design.crd(c("G1", "G2", "G3", "G4"),4)
.
Module 1 Lecture 6 13
Complete Randomized Design (CRD)
ijiijY εαµ ++=
(residual)error alexperimentε
ntth treatme-i theofeffect α
mean populationμ
repth -j on thent th treatme-i theof responseY
ij
i
ij
==
=
=
ASSUMPTIONS:
1. TO THE MODEL:• Is correct (in relation to the experimental units)• Is additive
2. EXPERIMENTAL ERRORS (for Hypothesis testing):• Are random variables• εij ~ N• E(εij) = 0 for every i, j• V(εij) = σ2 for every i, j• Are independent
3. “BY DEFINITION” ααααi = μi - μ
25
Module 1: Lecture 6
Analyze in R: >anova(lm(Y ~ trt, data=my.data))
Complete Randomized Design (CRD)
T1 T2 T3 T4
47 50 57 54
52 54 53 65
62 67 69 74
51 57 57 59
53 57 59 63
SoV df SS MS F p
Geno (t-1) SST/gl(T) MST/MsE
Error t(r-1) SSE/gl(E)
Total tr-1 SSTOT/gl(TOT)
2... )(∑ −=
iiT yyrSS
∑ ∑= =
−=t
i
r
jiijE yySS
1 1
2. )(
2.. )(∑ −=
ijijTOT yySS
SoV df SS MS F p
Geno 3 208 69.3 1.29 0.323
Error 12 646 59.8
Total 15 854
58
26
Module 1: Lecture 6
.
Module 1 Lecture 6 14
Randomized Complete Block Design (RCBD)
TREATMENT ASSIGNMENT: Within a block, each treatment isassigned randomly to the experiemntal units within a block.Randomization in each block is independent.� Experimental unit: one plot.
WHEAT EXAMPLE. PLANT HEIGHT OF 5 ADVANCED LINES OF WHEAT IN 5 BLOCKS
Treatment: 5 advanced wheat linesBlock: 5 diferent blocksExperimental design: RCBDDependent variable: Y = cm
3 3
3
3
2 22
2 11
11
4
44
4
3
2
14
55
5
5
5B 1 B 2 B 3 B 4 B 5
EXPERIMENTAL PRINCIPLES1. Randomization2. Replication3. Local control
27
Module 1: Lecture 6
Randomize in R: >design.rcbd(trt, r) where trt is a list of all trea tments (genotypes) and r the number of blocks>design.rcbd(c("G1", "G2", "G3", "G4“, “G5”),5)
Randomized Complete Block Design (RCBD)
ijjiijY εβαµ +++=
ASSUMPTIONS:
1. TO THE MODEL:• Is correct (in relation to the experimental units)• Is additive• There is NO treatment by block interaction
2. EXPERIMENTAL ERRORS (for Hypothesis testing):• Are random variables• εij ~ N• E(εij) = 0 for every i, j• V(εij) = σ2 for every i, j• Are independent
3. “BY DEFINITION” ααααi = μi - μ
(residual)error alexperimentε
blockth -j theofeffect
ntth treatme-i theofeffect α
mean populationμ
blockth -j on thent th treatme-i theof responseY
ij
j
i
ij
=
==
=
=
β
28
Module 1: Lecture 6
Analyze in R: >anova(lm(Y ~ trt + block, data=my.data))
.
Module 1 Lecture 6 15
Randomized Complete Block Design (RCBD)
T1 T2 T3 T4 T5
B1 8 10 8 9 10 9
B2 7 9 8 8 9 8.2
B3 6 8 9 8 8 7.8
B4 6 7 9 8 7 7.4
B5 7 9 10 7 9 8.4
6.8 8.6 8.8 8.0 8.6 8.16
SoV df SS MS F p
Block (r-1) SSB/gl(B)
Geno (t-1) SST/gl(T) MST/MSE
Error (t-1)(r-1) SSE/gl(E)
Total tr-1
2... )(∑ −=
iiT yyrSS
∑ ∑= =
−−−=t
i
r
jjiijE yyyySS
1 1
2.... )(
2.. )(∑ −=
ijijTOT yySS
SoV df SS MS F p
Block 4 7.36 1.84 2.77 0.0637
Geno 4 13.36 3.34 5.02 0.0081
Error 16 10.64 0.665
Total 24 31.36
2... )(∑ −=
jjB yytSS
29
Module 1: Lecture 6
Field Designs and Heritability
ijkijkijkjijk PEPEy εβµ +++++= )(
SoV df EMS
Environment (e-1)
Block(Env) e(r-1)
Progeny (n-1)
ProgenyXEnv (n-1)(e-1)
Pooled error (n-1)(r-1)e
ProgenyPEεProgeny reVrVVMS ++=
FACTORIAL DESIGNS (HS, FS, RIL, DH, Clones)
PEεPE rVVMS +=
εError VMS =
( )D
2
AProgeny V4
1V
2
1V:FS
FF +++=
AProgeny V4
1V:HS
F+=
AProgeny V2V:RIL =ernere
hweFE
F
F
p
FF 222
2
2
2
22
ˆˆˆˆ
ˆ
ˆ
ˆ
σσσσ
σσσ
+++==
B630
Module 1: Lecture 6
.
Module 1 Lecture 6 16
Number of treatments
HOW TO DEAL WITH HIGH NUMBER OF TREATMENTS?
1. STRATIFICATION: Group genotypes with similar characteristics(maturity, color, family), compare within groups. NO BETWEEN GROUP COMPARISONS.
2. PRODUCE HOMOGENOUS EXPERIMENTAL UNITS: Make everyeffort to homogenize experimental area (look for soil similarity, fieldconditions to reduce variation, choose seeds of similar vigor).
3. USE REPEATED CHECKS: You may use checks in a systematicway to control or model soil heterogeneity.
4. EXPERIMENTAL DESIGN WITH SPATIAL CONSIDERATIONS. Use experimental designs that include a large number of treatmentswhile controling variability (i.e. alpha designs, unrep, etc.).
31
Module 1: Lecture 6
Other Considerations (Design)
1. PLOT SIZE:a. Big enough to have plants growing normally (i.e. one-plant does
not work in crops but works fine for trees).b. Big enough to represent trait variation (i.e. you might need larger
plots for yield than for maturity).c. Not too big as to have considerable within-plot variation (i.e.
increasing plot size increases within-plot variation).d. Balance between cost of increasing plot size and increasing
number of plots.
EXAMPLE. Mohammad et al. (2001) compared plot size and number of replications to detect differences in wheat. Bigger plots require less reps.
r=2
r=3
r=4
r=5
Difference to be detected
Plo
t siz
e
32
Module 1: Lecture 6
.
Module 1 Lecture 6 17
Other Considerations (Design)
2. PLOT SHAPE: a. Balance between plot border and plants within plot (i.e.
rectangular plots have more border than squared ones).b. Take into account gradients (i.e. if unidirectional gradient like
fertility long and narrow plots might be better).
EXAMPLE. Haapanen(1992) compared plot size and shape for height in pines
33
Module 1: Lecture 6
Other Considerations (Design)
3. ROW vs. HILL PLOTS:
EXAMPLE. Tragoonrung et al., 1990 compared row and hill-plots for different traits in barley. Hill-plots are ok for most traits but not for yield.
34
Module 1: Lecture 6
.
Module 1 Lecture 6 18
Other Considerations (Design)
4. BORDER ROW. It is a good idea to include a row of plants as a border row to avoid having plots in the margins of the experiment performing differently due to lack of competition.
5. OPERATORS NOISE. If measurements on the experiment will be performed by different people it is a good idea to consider operators noise by having different people measuring for example in different blocks.
6. EARLY TESTING. Small amount of seeds are available in early generations. Therefore replicating is a challenge. Taking this into consideration when deciding which traits will be measured early is important.
7. OTHER DESIGNS AND SPATIAL CORRELATIONS. More efficient designs exist for testing large numbers of genotypes in fields that might not be completely heterogeneous. Additionally, spatial corrections might improve estimations of genotypic means. 35
Module 1: Lecture 6
Other Considerations (Estimation)
1. MIXED MODELS. Careful interpretation of Expected Means Squares should be exercised especially when using Mixed Models. EMS is different if factors (i.e. environments, genotypes, etc. ) are defined as random or fixed effects:
a. E, G, GxE random:
within env:
a. G, GxE random, E fixed:
across env.:
222)( GIeG nNnEMS σσσ ++=
22)( GeG nNEMS σσ +=
( )222
222
2 IGe
IG
Fh
σσσσσ
+++=
( )2222
22
2 EIGe
G
Fh
σσσσσ
+++=
LW 2236
Module 1: Lecture 6
.
Module 1 Lecture 6 19
Other Considerations (Estimation)
2. NON-BALANCED DESIGNS . When designs are not balanced due to their design, missing plots or missing data, MS calculations are not straight-forward. There are four ways to estimate MS.
MIXED MODELS - INCOMPLETE BLOCKS – UNBALANCED
With unbalanced data, mixed models, or correlation among genotypes, variance component estimation of heritability is not related to the gain from selection. Using the concept of “effective error variance” (Cochran and Cox, 1957), Cullis et al (2006) proposed to use a generalized method for heritability:
22
21
G
BLUPhσ
υ−=
Piepho and Mohring 2007
BLUP twoof difference a of ncemean varia:BLUPυ
37
Module 1: Lecture 6
Module 1: Lecture 7 1
Lecture 7 QTL Mapping
Lucia Gutierrez lecture notes
Tucson Plant Breeding Institute
January 2018 Tucson, Arizona
. 1
Problem
QUANTITATIVE TRAIT: Phenotypic traits of poligenic effect (coded by many genes withusually “small” effect each one) and with environmental influence.
How To Select For Quantitative Traits?1. Traditional Breeding
2. Marker Assisted Selection
3. Genomic Selection
Identification of chromosome regions that affects quantitative traits
Chromosome
Molecular Markers
Gen evidence
. 2
Module 1: Lecture 7 2
Information needed
1. Molecular marker scores
2. Genetic map
3. Phenotypes
High throughput panels, controlled conditions, repeatable, cheap, automatic scoring.
More standard methods, small population sizes, consensus maps? Need some more development.
Crucial part, poor phenotypes means poor QTL mapping.
. 3
Outline
1. Linkage
2. Types of Populations
3. Map construction using linkage (overview)
4. QTL mapping using linkage
o QTL mapping: 1. Singel Marker Analysis
o QTL mapping: 2. Interval Mapping
o QTL mapping: 3. Composite Interval Mapping
5. Other issues:
o Multiple Testing
o Missing Data
o Epistasis
o QTLxE
o Polyploids
6. QTL estimation. 4
Module 1: Lecture 7 3
Linkage and Mendel’s Laws
LAW OF SEGREGATION (MENDEL’S FIRST LAW):
o Every individual carries two copies of a gene (alleles)
o Each parent passes only one of its copies to an offspring
o Parents A1A1 or A2A2 only produces ‘A1’ or ‘A2’ gametes respectively, heterozygous produces ‘A1’ and ‘A2’ in 50/50.
LAW OF INDEPENDENT ASSORTMENT (SECOND LAW):
o Different genes segregate independently
o True in the absence of linkage
Wu, Ma, and Casella, 2010. 5
Linkage and RecombinationMolecular marker: short DNA segment (or a
single base in the case of SNPs)Locus: point on a chromosome (loci in plural).
i.e: markers 1 and 2Allele: gene variant. Yellow = maternal. Red =
paternal
Marker 1 (A)
Marker 2 (B)
x
A1A1B1B1
Selfing the F1 produces two types of gametes: - Parental: same combination as in original
lines (A1B1 or A2B2) - Recombinant: one allele from each parent
(A1B2 or A2B1)
PR R
Marker 1
Marker 2
P
Define r as the recombination frequency: r = # recombinants / total
If there is linkage there are more parental than recombinant gametes– r closer to 0 = strong linkage– r closer to 0.5 = independence
2
1 r−2
1 r−2
r
2
r
Wu, Ma, and Casella, 2010
A2A2B2B2
A1A2B1B2
. 6
Module 1: Lecture 7 4
Linkage and Recombination• The higher the recombination between two loci, the higher the genetic distance between them.
• If independent all four gametes are equally frequent with 0.25 each one (maximum recombination = 0.5).
• Mapping function:
• Therefore recombination ratios (under certain circumstances) can be used to determine genetic distances (among loci, markers or between markers and QTL).
Wu, Ma, and Casella, 2010
Marker 1 (A)
Marker 2 (B)x
A1A1B1B1
PR R
Marker 1
Marker 2
P
2
1 r−2
1 r−2
r
2
r
A2A2B2B2
A1A2B1B2
. 7
QTL Mapping
KEY IDEA:
If a molecular marker is “associated” to the phenotype (i.e. the mean trait value for individuals with marker state MM is different from the mean trait value of individuals with marker state mm), then the marker is linked to a QTL.
. 8
Module 1: Lecture 7 5
Populations
We need genetically diverse populations!
There are two options:1. Design a population with known recombination even ts .
o Types of populations: RIL, DH, F2, BC, etc.o Linkage mapping (also known as: “Traditional QTL Mapping”,
“Bi-parental QTL Mapping”, “Balanced population QTL Mapping”, “ QTL Mapping”, etc.).
2. Use existing diverse populations. o LD related to distance + OTHER causes.o Need to account for other causes of LD.o Association Mapping (also known as: “Linkage
Disequilibrium Mapping”, “LD Mapping”, “GPD Mapping”, “GWAS”, etc.).
. 9
Designed Populations – F 2
F2 populationA2
The F1 is selfed one time.All 3 possible genotypes are present: A1A1, A1A2, and A2A2.Short ‘history’ of recombination.Allows to distinguish additivity from dominance.
A1
P1 P2
. 10
Module 1: Lecture 7 6
Designed Populations – BC
Back-cross population (BC)
The F1 is “back” crossed to one of the parents.The BC lines carries
o a full chromosome from the recurrent parento a chromosome with mosaic of the two parents
Possible genotypes A1A1 or A1A2 (the A2A2 is not present)Short ‘history’ of recombination.
A2A1
P1 P2
P1
. 11
Designed Populations - DH
Doubled-Haploid Population (DH): xA1 A2
F1
F1 -gametes
DH
F1 gametes are duplicated.Complete homozygous individuals (only A1A1 and A2A2 genotypes possible).One generation of recombination.
P1 P2
. 12
Module 1: Lecture 7 7
Designed Populations - RIL
Recombinant Inbred Lines (RIL):
F2 are selfed for several generations.Heterozygosity decreases ½ each generation.More generations means more recombination.
A2A1
P1 P2
. 13
Designed Populations – MP
Multiparent or 4-way Cross Population (MP):
. 14
Module 1: Lecture 7 8
Map construction
Markers
STUDY LINKAGE AMONG MARKERS
. 15
Map construction
Markers
recombination
Markers 1, 5 and 10 are linked.
Markers 1 and 5 are more closely linked than 10.
10
. 16
Module 1: Lecture 7 9
Map construction
Markers
recombination
Markers 2 and 7 are also linked
. 17
Map construction
. 18
Module 1: Lecture 7 10
Mapping Traits (Qualitative):
R R RR R RS S SS SSS S S SSS S
Wu, Ma, and Casella, 2010
Two-point analysis
. 19
Mapping Traits (Quantitative):
1
2
3456789
125112 118 118 129 115122 12299 101 108 92 100 124105 95 117 103123
How to test for an association?
. 20
Module 1: Lecture 7 11
Mapping Traits (Quantitative)
n=500a=8d=0mu=50sd=2
Fre
quen
cy
35 40 45 50 55 60 65
01
020
3040
50
AAABBB
44
4648
5052
5456
n=500a=0d=0mu=50sd=2
Fre
quen
cy
35 40 45 50 55 60 65
010
2030
4050
AAABBB
Phenotypic distribution is a mixture distribution.
A1A1A1A2A2A2
A1A1A1A2A2A2
F2 Population
. 21
Mapping Traits (Quantitative)
n=500a=8d=0mu=50sd=2
Fre
quen
cy
35 40 45 50 55 60 65
010
2030
4050
AAABBB
4446
4850
5254
56
n=500a=0d=0mu=50sd=2
Fre
quen
cy
35 40 45 50 55 60 65
010
2030
4050
AAABBB
-1.0 -0.5 0.0 0.5 1.0
3540
455
055
6065
-1.0 -0.5 0.0 0.5 1.0
35
4045
5055
6065
How do we test for an association?• Linear models:
t-test, ANOVA (F-test), regression,
• Maximum Likelihood: LRT.
A1A1
A1A2
A2A2
A1A1
A1A2
A2A2
. 22
Module 1: Lecture 7 12
MMQQ
Mapping Traits (Quantitative)
Assume:• a co-dominant marker (M)• a QTL (Q)• r linkage between M and Q
Genotypic values:• .• .• .
QQfor a P +Qqfor d P +
qqfor a - P
mmqq
MmQq
Marker . Conditional frequency .
Genotype Frequency QQ Qq qq
MM ¼ (1-r)2 2r(1-r) r2
Mm ½ r(1-r) 1-2r+2r2 r(1-r)
mm ¼ r2 2r(1-r) (1-r) 2
Value of F2 individuals
F1 Gametes: Pr(MQ) = ½ (1-r) = Pr(mq)Pr(Mq) = ½ r = Pr(mQ)
a P + d P + a P −
F2 Population
BUT QTL IS NOT ON TOP OF OBSERVED MARKERSo we use conditional probabilities of QTL genotypes
)Pr(M
)MPr(Q)M|Pr(Q
j
jkjk =
F2 Individuals: ( ) ( ) ( )2
41
21
21
11*1
Pr(MM)
Pr(QQMM)MM)|Pr(QQ r
rr −=−−==
Bernardo, 2010
. 23
Mapping Traits (Quantitative)
Marker . Conditional frequency .
Genotype Frequency QQ Qq qq
MM ¼ (1-r)2 2r(1-r) r2
Mm ½ r(1-r) 1-2r+2r2 r(1-r)
mm ¼ r2 2r(1-r) (1-r) 2
Value of F2 individuals
F2 Population
r)d-2r(1 2r)-a(1 P )(Fmm
)2r 2r - d(1 P )m(FM
r)d-2r(1 2r)-a(1 P )(FMM
2
22
2
+−=++=
++=Mean of individuals:
( ) 2r)-2a(1 mm - MM2F =
2
F
2r)-d(12
mm MM - mM
2
=
+
Differences:
ASPECTS TO NOTE:1. QTL effect and position are
confounded (i.e. the same mean difference could be achieved with a tightly linked QTL of small effect than with a loosely linked QTL of large effect).
2. In F2 it is possible to estimate additive and dominance effects.
3. Both a* and d* are underestimated due to the unknown fraction of r.
a P + d P + a P −
Bernardo, 2010
. 24
Module 1: Lecture 7 13
Mapping Traits (Quantitative)F2 POPULATION
r)d-2r(1 2r)-a(1 P )(Fmm
)2r 2r - d(1 P )m(FM
r)d-2r(1 2r)-a(1 P )(FMM
2
22
2
+−=++=
++=Mean of individuals:
( ) 2r)-2a(1 mm - MM2F =
2
F
2r)-d(12
mm MM - mM
2
=
+
Differences:
F1 Gametes: Pr(MQ) = ½ (1-r) = Pr(mq)Pr(Mq) = ½ r = Pr(mQ)
F2 Individuals: MMQQ, MMQq, MMqqMmQQ, MmQq, MmqqmmQQ, mmQq, mmqq
BC POPULATION
r)-a(1 dr P (BC)mm
ar - r) - d(1 P m(BC)M
++=+=
Mean of individuals:
( ) 2r)-d)(1(a mm - mM BC +=Differences:
BC Individuals: MmQq, MmqqmmQq, mmqq
F1 Gametes: Pr(MQ) = ½ (1-r) = Pr(mq)Pr(Mq) = ½ r = Pr(mQ)
P2 Gametes:Pr(mq)=1
( ) ( )rr −=−== 1
1*1
Pr(Mm)
Pr(QqMm)Mm)|Pr(Qq
21
21
Bernardo, 2010
. 25
Mapping Traits (Quantitative)
DH POPULATION
2r)-a(1 P (DH)mm
2r)-a(1 P (DH)MM
−=+=
Mean of individuals:
( ) 2r)-2a(1 mm - MM DH =Differences:
2R)-a(1 P (RIL)mm
2R)-a(1 P (RIL)MM
−=+=
( )
mating-sib from RILfor 6r1
4rR
selfing from RILfor 2r1
2rR
2R)-2a(1 mm - MM RIL
+=
+=
=
RIL POPULATION
Mean of individuals:
Differences:
F1 Gametes: Pr(MQ) = ½ (1-r) = Pr(mq)Pr(Mq) = ½ r = Pr(mQ)
DH Individuals: Pr(MMQQ) = ½ (1-r) = Pr(mmqq)Pr(MMqq) = ½ r = Pr(mmQQ)
( ) ( )rr −=−== 1
1
Pr(MM)
Pr(QQMM)MM)|Pr(QQ
21
21
Bernardo, 2010
. 26
Module 1: Lecture 7 14
How to map QTL?
Steps for mapping QTL through LINKAGE:
1. Create a designed population.
2. Collect genotypic information on parents and offspring in the form of molecular markers scores.
3. Look for linkage between marker loci.
4. Construct a genetic map.
5. Detect quantitative trait loci (testing for association between a phenotypic trait and a marker).
o Qualitative trait: two-point linkage test.
o Quantitative trait: linear models (t-test, ANOVA, marker-regression) or maximum likelihood tests (LRT). Use single marker analysis, interval mapping or composite interval mapping.
. 27
QTL mapping: 1. Single Marker Analysis
SINGLE MARKER ANALYSIS (SMA), MARKER REGRESSION (MR):
IDEA: If there is a significant association between a molecular markerand a quantitative trait, then, it is possible that a QTL exists close tothat marker. A marker at a time is tested through a linear model (i.e. t-test, ANOVA or regression), or using the full density function for themixture distribution.
WHEN TO USE IT? To look at the data roughly and to study missingdata patterns. OK if you are not interested in estimating position norQTL effects.
iii xy εββ ++= 10
( ) ( )∏∑=i j jiij ypL 2,;, σµφσµ
Linear Model:
Maximum Likelihood:
. 28
Module 1: Lecture 7 15
hv m30 0.0
e38m50l 21.2
30.4
abc 302 52.9bc d298 58.3
ms rh 78.2abg69 86.0
ebmac 539a 110.4hv dhn7 117.2hv dhn9
126.7
e 33m42d 150.8
rga23 176.3
Bmag337
Fre
quen
cy
35 40 45 50 55 60 65
010
2030
4050
AAABBB
4446
4850
5254
56
Fre
quen
cy
35 40 45 50 55 60 65
010
2030
4050
AAABBB
Broman and Sen, 2009
A1A1
A1A2
A2A2
A1A1
A1A2
A2A2
QTL mapping: 1. Single Marker Analysis
. 29F
requ
ency
35 40 45 50 55 60 65
010
2030
4050
AAABBB
4446
4850
5254
56
Fre
quen
cy
35 40 45 50 55 60 65
010
2030
4050
AAABBB
A1A1
A1A2
A2A2
A1A1
A1A2
A2A2
QTL mapping: 1. Single Marker AnalysisLINEAR MODELS Use the difference in mean trait value for differentmarker genotypes to detect QTL and to estimate its effects.
One way ANOVA
Regression
p-values are used for profile plots
ijiijy εαµ ++=
Value of the trait in the j-th individual with marker genotype i
Effect of marker genotype i on trait value
ijiij xy εββ ++= 10
Value of the trait in the j-th individual with marker genotype i
marker genotype i
Effect of marker genotype o trait value
. 30
Module 1: Lecture 7 16
Fre
quen
cy
35 40 45 50 55 60 65
010
2030
4050
AAABBB
4446
4850
5254
56
Fre
quen
cy
35 40 45 50 55 60 65
010
2030
4050
AAABBB
A1A1
A1A2
A2A2
A1A1
A1A2
A2A2
QTL mapping: 1. Single Marker AnalysisMAXIMUM LIKELIHOOD METHODS Use the entire distribution ofthe data. It is more powerful as linear models but not as flexible.
ML Function
Test
LOD scores are used for profile plots
( ) ( ) ( )jk
N
iQkj MQzMz |Pr,,|
1
2∑=
= σµϕl
Trait value given the marker genotype
Density function for a normal distribution of the trait value given the QTL genotype with mean (µ Qk).
Conditional probability of QTL genotype given marker genotype.
=
(z)max
r(z)max -2lnLR
ll
61.4
)(
(z)max
r(z)max -logLOD(c) 10
cLR≅
=
ll
Max. Likelihood under no-QTL
Max. Likelihood under full model
. 31
QTL mapping: 1. Single Marker AnalysisLOD score or –log(p-value)
. 32
Module 1: Lecture 7 17
Single Marker Analysis (+):
• Simple
• No specific software requirement
• No chromosome position requirement
• No estimation of QTL position
Single Marker Analysis (-):
• Single QTL model
• Confounding of QTL estimation and position
• Loss of power due to residual variance caused by other QTL
• n tests
QTL mapping: 1. Single Marker Analysis
. 33
Issue 1: Multiple Testing
. 34
Module 1: Lecture 7 18
Issue 1: Multiple Testing
. 35
Issue 1: Multiple Testing
A common practice is to use P<0.05 to decide about significance of a test. But with large number of tests the chance of having a false positive is almost 1...
OTHER OPTIONS1. Bonferroni multiple-testing protection:
1. P = 0.05 / number of tests (this is very conservative)2. P = 0.05 / effective number of tests, with the effective number
of tests estimated from the marker data (Li and Ji, 2005).
2. Permutations (Broman and Sen, 2009)
3. False Discovery Rate (Benjamini and Hochberg, 1995)
. 36
Module 1: Lecture 7 19
Issue 2: Missing data problem
Can we say something about unobserved positions?
Yes, we can use information from neighbouring markers if we know the recombination history of the population we can calculate Conditional probabilities of QTL genotypes
?
FILLING INFORMATION
o We only have information at specific positions (marker positions). We might be interested to make in-between marker inference.
o Marker information can be incomplete (missing values) or partially informative (dominant markers).
)Pr(M
)MPr(Q)M|Pr(Q
j
jkjk =
Broman and Sen, 2009. 37
Issue 2: Missing data problem
1. QTL genotype probabilities can be computed at any point on the genome using observed markers (Markov chain methods, Jiang and Zeng, 1997).
2. Missing values in markers can be imputed (so no need to exclude genotypes with missing marker data).
3. The conditional QTL probabilities are at the core of QTL mapping methods (but that is taken care for by the software).
. 38
Module 1: Lecture 7 20
QTL mapping: 2. Interval Mapping
SIMPLE INTERVAL MAPPING (SIM):
IDEA: information on adjacent markers can be used to improveestimations. You may either use the genotypic predictors (pseudo-marker) or the conditional probabilities directly to scan for significantassociations one marker (or pseudo-marker) at a time or to use the fulllikelihood function for the interval.
WHEN TO USE IT? To fill n missing data information or to enrichmarker data information. OK if background QTL are not important.
iii xy εββ ++= 10
( ) ( )∏∑=i j jiij ypL 2,;, σµφσµ
Linear Model:
Maximum Likelihood:
. 39
hv m30 0.0
e38m50l 21.2
30.4
abc 302 52.9bc d298 58.3
ms rh 78.2abg69 86.0
ebmac 539a 110.4hv dhn7 117.2hv dhn9
126.7
e33m42d 150.8
rga23 176.3
Bmag337
Uses contiguous marker information to improve the estimation of marker effects:
Marker 1 QTL Marker 2
c1 c2
c1c2
2
2
=
−
=
=
2cc
)MqqMMPr(M
2)c)(1-c(1
2cc
2)MQqMMPr(M
2)c)(1-c(1-
)MQQMMPr(M
212211
21212211
212211
( ) ( )( )
( )( )( )
( )212
22
21
2211
212
21212211
2211
c1
cc)MMMM|Pr(qq
c1
c1c1c2c)MMMM|Pr(Qq
)MMMM|Pr(QQ
−=
−−−=
−−−= 2
12
22
21
1
11
c
cc
Conditional probability of the genotypes
Conditional probability of the QTL
QTL mapping: 2. Interval Mapping
Broman and Sen, 2009
L function
Maximum Likelihood Methods:Uses the likelihood function and the conditional probabilities inside the interval defined by two markers to determine the most likely position of the QTL inside the interval.
. 40
Module 1: Lecture 7 21
hv m30 0.0
e38m50l 21.2
30.4
abc 302 52.9bc d298 58.3
ms rh 78.2abg69 86.0
ebmac 539a 110.4hv dhn7 117.2hv dhn9
126.7
e 33m42d 150.8
rga23 176.3
Bmag337
Uses contiguous marker information to improve the estimation of marker effects:
Marker 1 QTL Marker 2
c1 c2
c1c2
2
2
=
−
=
=
2cc
)MqqMMPr(M
2)c)(1-c(1
2cc
2)MQqMMPr(M
2)c)(1-c(1-
)MQQMMPr(M
212211
21212211
212211
( ) ( )( )
( )( )( )
( )212
22
21
2211
212
21212211
2211
c1
cc)MMMM|Pr(qq
c1
c1c1c2c)MMMM|Pr(Qq
)MMMM|Pr(QQ
−=
−−−=
−−−= 2
12
22
21
1
11
c
cc
Conditional probability of the genotypes
Conditional probability of the QTL
QTL mapping: 2. Interval Mapping
c.120
c.10
c.20
c.40c.50
c.60c.70
c.90
c.100
c.110
c.130c.140c.150
c.160
c.170
Broman and Sen, 2009
Haley-Knott Regression:Uses the conditional probabilities calculated inside the interval defined by two markers directly as pseudo-markers and performs a regression on each point.
. 41
Simple Interval Mapping (+):
• Evaluation at and between markers
• Estimation of QTL position
• No specialized software (only for conditional distribution calculations)
Simple Interval Mapping (-):
• Single QTL model
• Loss of power due to residual variance caused by other QTL
• n-1* tests
With high marker density it is very similar to marker regression
QTL mapping: 2. Interval Mapping
. 42
Module 1: Lecture 7 22
We have fitted individual (virtual) markers to phenotypic responses: single QTL models (Single Marker Analysis and Simple Interval mapping). However, we expect a particular phenotypic trait to be caused by a number of QTL simultaneously.
When testing for individual markers, tests for QTL will have diminished power because of QTL segregating at other positions than the evaluation position.
What strategy to follow to arrive at multi-QTL mode ls?
Use markers earlier identified as close to or coinciding with significant QTL/ genes as covariables (co-factors) in a subsequent genome wide scans.
QTL mapping
. 43
QTL mapping: 3. Composite Interval
COMPOSITE INTERVAL MAPPING (CIM):
IDEA: On top of using contiguous marker information, use backgroundloci to get a better estimation of QTL effects. MR and SIM providesbiased estimation when multiple QTL are close to a marker and haveless power in general. The problem is how to select the cofactors.
WHEN TO USE IT?: It is the preferred method because it has morepower and decreases bias due to contiguous QTL.
. 44
Module 1: Lecture 7 23
hv m30 0.0
e38m50l 21.2
30.4
abc 302 52.9bc d298 58.3
ms rh 78.2abg69 86.0
ebmac 539a 110.4hv dhn7 117.2hv dhn9
126.7
e 33m42d 150.8
rga23 176.3
Bmag337
c.120
QTL mapping: 3. Composite Interval
Uses markers as cofactors to improve the estimation of genetic background interactions.
No cofactor is allowed within windows of specific size to avoid over fitting.
Conditional probabilities in-between markers are still used to improve estimations.
1
2iii CCMy εµ ++++= 21
iii CMy εµ +++= 2
iii CMy εµ +++= 1
Outside both windows:
Inside window 1:
Inside window 2:
c.10
c.20
c.40c.50
c.60c.70
c.90
c.100
c.110
c.130c.140c.150
c.160
c.170
Broman and Sen, 2009. 45
Cofactors excluded when 50 cM or less apart Cofactors excluded when ONLY 5 cM or less apart
Careful with interpretation. Sharp peaks are caused by the window size, not a higher precision for QTL location.
QTL mapping: 3. Composite Interval
. 46
Module 1: Lecture 7 24
Composite Interval Mapping (+):
• Evaluation at and between markers
• Estimation of QTL position
• Control of genetic background interactions
• Multiple QTL screened
Composite Interval Mapping (-):
• How to select marker-cofactors?
QTL mapping: 3. Composite Interval
. 47
Power and Repeatability: 1. Underestimation of the number of QTL2. Over-estimation of effects
Beavis effect
Bernardo 2010. 48
Module 1: Lecture 7 25
QTL LOCATION (point estimate).o After having identified the significant genetic predictors, QTL are typically located at the positions of the maximum value for the test statistic in a chromosome region where all genetic predictors exceeded the threshold for significance.o This final identification of QTL using LOD, t, F, Wald, deviance or –log10(p) values is not trivial due to the irregularity of many test statistic profiles.
QTL EFFECTS (point and interval estimates)o Given that QTL and their locations were identified, point and interval estimates for QTL effects can be obtained from fitting a linear (mixed) model with all identified QTL. o Some clean up procedure maybe necessary after genome scans to arrive at a final acceptable multi-QTL model.o 95%CI (F2 and BC) = 530/(nr of individuals*fraction explained phenotypic variance) (Darvasi and Soller, 1997).
QTL estimation
. 49
EXPLAINED PHENOTYPIC VARIATION BY QTLo Extra sums of squares/ partial r2.o Comparing (residual) genetic variances between models. o From expressions in quantitative genetics and standard mixed model applications.
QTL estimation
. 50
Module 1: Lecture 7 26
Issue 3: EpistasisEpistasis might be important for some traits. There are several options for detecting Epistasis:
Linear models. Include an interaction term to be tested in any of the linear models discussed earlier and test for significance. The problem is the large number of potential interaction terms leads to supersaturated models (i.e. there are more parameters (p) to estimate than there are independent samples (n)). One possibility is to use a forward selection approach adding interaction terms one at a time and compare the models.
Bayesian approaches. You may fit a model including all possible terms in the model assuming that parameters are either close to zero, or have a large value.
. 51
Issue 4: PolyploidsTYPE OF POLYPLOIDS
o Allopolyploids (i.e. chromosome sets derived from different species like in wheat) with meiosis pairing of ancestral genotypes can be mapped as diploid species.
o Autopolyploids (i.e. chromosome sets derived from the same ancestral species) may have bivalent (i.e. two chromosome pairing) or multivalent pairing during meiosis. Identifying alleles, number of allele copies, and linkage phase is a challenge.
o Bivalent only two chromosomes pair during meiosis, segregating one chromosome to each set of gamete.
o Multivalent multiple chromosomes pair during meiosis, resulting in gametes with different combinations of the chromosomes.
. 52
Module 1: Lecture 7 27
Issue 4: PolyploidsMAPPING ALTERNATIVES
o Diploid relatives . Use diploid species that are related to polyploids of interest. However polyploidization is highly dynamic, not all polyploidsof interest have diploid relatives and breeding polyploid species is usually managed at the polyploid level.
o Single dose restriction fragments (Ritter et al., 1990). Use simplex parents (Mmmm) to produce gametes that segregates in a 1:1 ratio (Wu et al., 1992). Could be used with dominant or co-dominant markers.
o Multiple dose markers (Ripol et al., 1999).
o Doerge and Craig (2000) assumed complete preferential pairing (ok for Allopolyploids).
o Hacket et al. (2001) assumed random chromosome pairing (ok for extreme Autopolyploids).
o Ma et al. (2002) and Wu et al. (2004) incorporated a preferential pairing factor that depends on chromosome similarity.
o Cao et al. (2005) extended it to any even ploidy level.. 53
Issue 5: GxE Interaction
Genotype1 Genotype 2 Genotype 1 Genotype 2
ENV 1 ENV 2
E1 E2
G2
G1
R
Why QTLxE?
o Use appropriate residuals to test effects
o To identify main-effect QTL and environment-specific QTL
. 54
Module 1: Lecture 7b 1
Multiple Testingin QTL Mapping
Lucia Gutierrez lecture notes
Tucson Winter Plant Breeding Institute
1
Multiple Testing
2
Module 1: Lecture 7b 2
Multiple Testing
3
Multiple Testing
WHAT IS WRONG WITH THE PREVIOUS CARTOON?A common practice is to use P<0.05 to decide about significance of a test. But with large number of tests (as is common in QTL mapping were we perform one test at each marker), the chance of having at least one false positive reaches 1 pretty quickly.
Let’s look at some theory to clarify this concept.
4
Module 1: Lecture 7b 3
Hypothesis Testing (in QTL context)
OUTCOMES OF A STATISTICAL TEST
FALSE POSITIVE: occurs when a QTL is incorrectly declared presentTRUE POSITIVE: occurs when a QTL is correctly declared present.
FALSE NEGATIVE: occurs when a QTL is incorrectly declared absent.TRUE NEGATIVE: occurs when a QTL is correctly declared absent.
5
“THE TRUTH”
RESULT FROM TEST
Reject H0
Do notreject H0
H0 is True(i.e. no QTL)
H0 is False(i.e. a QTL)
False Positive
True Positive
True Negative
False Negative
B5
Type I error: It is the incorrect rejection of a true null hypothesis (an incorrectly declared QTL) in a given test. Probability of Type I error ( α): It is the probability of finding a false positive (an incorrectly declared QTL) in a given test.
Type I Error
6
“THE TRUTH”
RESULT FROM TEST
Reject H0
Do notreject H0
H0 is True(i.e. no QTL)
H0 is False(i.e. a QTL)
False Positive
True Positive
True Negative
False Negative
B5
Module 1: Lecture 7b 4
Type II error: It is the failure to reject a true null hypothesis (when there is a QTL and we fail to declare it) in a given test. Probability of Type II error ( β): It is the probability of finding a false negative (failure to declare a true QTL) in a given test.
Type II Error
7
“THE TRUTH”
RESULT FROM TEST
Reject H0
Do notreject H0
H0 is True(i.e. no QTL)
H0 is False(i.e. a QTL)
False Positive
True Positive
True Negative
False Negative
B5
NOTE that for a given test, decreasing the false positive rate means that the power (the proportion of true positives, or 1- β) is also decreased. The only way to decrease false positives and increase the power is by changing your design (i.e. increasing the population sizes, reducing experimental error, etc.).
Type I and II Error
8
“THE TRUTH”
Reject H0
Do notreject H0
H0 is True(i.e. no QTL)
H0 is False(i.e. a QTL)
False Positive
True Positive
True Negative
False Negative
B5
“THE TRUTH”
RESULT FROM TEST
Reject H0
Do notreject H0
H0 is True(i.e. no QTL)
H0 is False(i.e. a QTL)
False Positive
True Positive
True Negative
False Negative
Module 1: Lecture 7b 5
Multiple Testing
Let’s now assume that we are not interested in a single hypothesis testing, but we are conducting multiple hypothesis testing. For example, we are performing one hypothesis testing on each marker (at each marker we ask the question about whether the marker is associated to a QTL or not). We could summarize the information of the number of hypothesis that follow each category in the following chart:
9
Benjamini and Hochberg, 1995
Decision
“Truth” Do not reject H0 Reject H0 Total
H0 is true U V m0
H0 is false Z S m1=m-m0
Total m-R R m
Multiple Testing
Some more definitions:Per-family error rate (PFER): expected number of false positives: E(V).Per-comparison error rate (PCER): proportion of false positives in the total: E(V)/m.Family-wise error rate (FWER): probability of at least one false positive: P(V≥1). False Discovery Rate (FDR): proportion of false positives among rejected: E(V/R).
We could be interested in controlling the probability of having at least one false positive test (using family control). Or we could be interested in correcting the results by the proportion of false discoveries among the reject tests (using FDR).
10
Benjamini and Hochberg, 1995
Decision
“Truth” Do not reject H0 Reject H0 Total
H0 is true U V m0
H0 is false Z S m1=m-m0
Total m-R R m
Module 1: Lecture 7b 6
Multiple Testing – Family control
If we aim to control FWER, it is necessary to use a somewhat smaller alpha value. But how small should it be? There are many methods that aim to control FWER. We will discuss two common methods that have been used in QTL mapping.
Bonferroni:Since the increase in the error is related to the number of independent tests, Bonferroni proposed to use a new alpha value as threshold:
However, many studies have shown that a Bonferroni correction for QTL studies is overly conservative mainly because tests are not independent (i.e. markers are linked and therefore not-independent). Having an unnecessarily stringent threshold reduces power to detect true QTL as has been shown.
11
( )
m
m
ααα
≅
−−=1
11*
Benjamini and Hochberg, 1995; Li and Ji, 2005
Multiple Testing – Family control
Li and Ji (2005):An alternative is to use a Bonferroni correction but instead of using the total number of tests (which we know are not independent because markers are correlated), is to use the effective number of independent tests. This idea was first proposed by Cheverud (2001) and then modified by Li and Ji(2005). The steps are:1. Calculate the correlation matrix of markers.2. Use the number of significantly different from zero eigenvalues (λi) of the
correlation matrix to determine the effective number of independent tests (Meff) as:
3. Use a Bonferroni-type of correction to determine the threshold but using the effective number of independent tests instead of the total number of tests:
12
Li and Ji, 2005
( )
Meff
Meff
ααα
≅
−−=1
11*
( )( ) ( ) ( ) 0,1
1
≥−+≥=
= ∑ =
xxxxIxf
fMM
i ieff λ
Module 1: Lecture 7b 7
Multiple Testing – Family control
Li and Ji (2005):This method have been shown to be better than the Bonferroni correction and the Cheverud (2001) method. It performs equally good as permutation but is fast and simple to perform.
13
Li and Ji, 2005
Multiple Testing – False Discovery
The False Discovery Rate (FDR) is the proportion of false positives amongst the rejected hypothesis: E(V/R).
We are looking at the problem from a different perspective: out of the totaltests that we reject, how many aretrue positives?
14
Benjamini and Hochberg, 1995
Decision
“Truth” Do not reject H0 Reject H0 Total
H0 is true U V m0
H0 is false Z S m1=m-m0
Total m-R R m
Reject H0
Do notreject H0
H0 is True(i.e. no QTL)
H0 is False(i.e. a QTL)
False Positive
True Positive
True Negative
False Negative
Module 1: Lecture 7b 8
Multiple Testing – False Discovery
False Discovery Rate (FDR): The steps to use the False Discovery Rate are as follow:1. Order the observed p-values: p(1) ≤ ... ≤ p(m)2. Calculate an arithmetic sequence as follows:
3. Reject all hypothesis where:
FDR is also conservative in most cases. Effective number of markers and a FDR using Meff instead of m could be used (Li and Ji, 2005).
15
Benjamini and Hochberg, 1995
( )
≤=
≤≤α
m
ipik i
mi:max
1
αm
i
Multiple Testing - Permutation
Permutations (Broman and Sen, 2009).
The steps involved are: 1. Randomize (shuffle) the phenotypes relative to the marker data.2. Perform a QTL mapping and obtain p-values (or LOD scores) and
keep the most extreme value (i.e. Maximum LOD or smallest p-value): Mi*
3. Repeat 1 and 2 several (r) times (i.e. 1,000 or 10,000).4. Produce the empirical distribution of the extreme values: M1*,... Mr*.5. Use the 95th percentile of Mi* values as the threshold.
This are computing-intensive but are precise for all cases.
16
Module 1: Lecture 7b 9
Multiple Testing
17
Multiple Testing
SO WHAT WAS WRONG WITH THE PREVIOUS CARTOON?A common practice is to use P<0.05 to decide about significance of a test. But with large number of tests the chance of having at least one false positive is close to 1.
LET’S LOOK AT OUR OPTIONS IN HYPOTHESIS TESTING
1. Family Control : Bonferroni multiple-testing protection1. P = 0.05 / number of tests (this is very conservative)2. P = 0.05 / effective number of tests, with the effective number
of tests estimated from the marker data (Li and Ji, 2005).
2. False Discovery Rate (Benjamini and Hochberg, 1995).
3. Permutations (Broman and Sen, 2009).
18
Lecture 8: Association Mapping
Mike Gore lecture notesTucson Plant Breeding Institute
Module 1
1
• Genetic mapping• Linkage disequilibrium• Population structure• Complex trait dissection
Lecture 8: Association Mapping
2
Linkage Analysis: Family
20 cM interval could contain 200 or more genes
P1
F1
F2
×
P2
1 generation ofrecombination
QTLInterval
Only hundreds of markers are needed to capture the recentrecombination events but at the expense of lower resolution 3
Many generationsof recombination
Association mapping: Natural populations
Higher resolution mapping of causative loci relative to linkageanalysis, but potentially thousands to millions of genetic markersneed to be scored on the population 4
Resolution (bp)
Res
earc
h t
ime
(yea
r)
1 1 x 104 1 x 1071
5
Association mapping
Positional cloning
Recombinant inbred lines
Pedigree
Intermated recombinant inbreds
F2 / BC
Near-isogenic lines
All
ele
nu
mb
er
10
2
40
Yu and Buckler 2006. Curr. Opin. Biotechnol. 17:155-160
Linkage analysis vs. Association mapping
5
AssociationTests
• Evaluate whether SNPs associate withphenotype
• Natural populations
• Exploit extensive recombination
1.3m
1.5m
1.4m
1.8m
2.0m
2.0m
T A GA A
C G GA A
C G TA A
T A TC G
C G TA G
T G GA G6
Manhattan plot: summarize association mapping results
• Identify genomic regions associated with a phenotype• Fit a statistical model at each SNP in the genome• Fitted models are used to test H0: No association between
SNP and phenotype
Lipka7
• Genetic mapping• Linkage disequilibrium• Population structure• Complex trait dissection
Lecture 8: Association Mapping
8
Walsh
Linkage vs. Linkage disequilibrium
• Linkage = excess of parental gametes from aparticular parent
• Linkage disequilibrium = nonrandom distribution oflinkage phases in the population
9
AB/ab
Excess of parentalgametes
AB, ab
linkage
Ab/aB
Excess of parentalgametes Ab, aB
AB/ab
Excess of parentalgametes
AB, ab
Ab/aB
Excess of parentalgametes Ab, aB
Pool all gametes to estimate LD: AB, ab, Ab, aB equally frequent
No LD: random distribution of linkage phases
Walsh10
AB/ab
Excess of parentalgametes
AB, ab
linkage
AB/ab
Excess of parentalgametes
AB, ab
AB/ab
Excess of parentalgametes
AB, ab
Ab/aB
Excess of parentalgametes Ab, aB
Pool all gametes to estimate LD: Excess of AB, ab due to an excess of AB/ab parents
With LD, nonrandom distribution of linkage phase
Walsh11
D(AB) = freq(AB) - freq(A)*freq(B)If A and B are independent, then LD = 0. If LD ≠ 0, there is a correlation between A and B in the population. Its range is determined by the allele frequency (undesirable – needs to be normalized).
If a marker and QTL are linked, then the marker and QTL alleles are in LD in close relatives, generating a marker-trait association
The decay of D: D(t) = (1-c)t D(0)here c is the recombination rate and t is the number of random mating generations. Tightly-linked genes (small c) initially in LD can retain LD for long periods of time.
LD: Linkage disequilibrium
Walsh12
• The maximum value of D is a function of allele frequencies• For two biallelic loci, let p = Freq(A), q = Freq(B)• Dmax = min[p(1-q), (1-p)q] for D ≥ 0• Dmin = max[-pq, -(1-p)(1-q)] for D < 0
• Lewontin’s D′ (normalized D; 1964) defined as• D′ = D/Dmax for D ≥ 0• D′ = D/Dmin for D < 0
• Can also scale D by expressing it as the correlation r amongalleles• r = D/sqrt[p*(1-p)*q*(1-q)]• Under drift-mutation-recombination equilibrium,
E(r2) ~ 1/(1+4Nec)
Measures of LD
Walsh/Gore13
Measures of LD: D and D′
• D describes the difference between coupling and repulsiongamete frequencies
• Hedrick 1987. Genetics 117:331–341
• D captures information about allelic association and allelefrequencies
• D′ is preferred because it is normalized and thus rangesbetween 0 and 1
• D and D′ may be highly erratic with rare alleles and smallsample sizes
14
Measures of LD: r2
• r2 (0 to 1) is the squared value of Pearson’s correlationcoefficient
• Hill and Robertson 1968. Theor. Appl. Genet. 38:226–231
• r2 summarizes both recombinational and mutational histories,while D and D′ measures only recombination
• r2 is preferred in association studies because it is moreindicative of how markers might correlate with QTL
15
Gamete freq
AB 0.3
Ab 0.2
aB 0.4
ab 0.1
freq(A) = p = freq(AB) + Freq(Ab) = 0.5
freq(B) = q = freq(AB) + Freq(aB) = 0.7
freq(a) = 1- p = 0.5
freq(b) = 1- q = 0.3
Linkage-equilibrium value for AB = freq(A)*freq(B) = 0.35
DAB = Freq(AB) - Freq(A)*Freq(B) = -0.05
Dmin = max[-pq, -(1-p)(1-q)] = max(-0.35, -0.15) = -0.15
D′ = D/Dmin = -0.05/-0.15 = 0.33
r = D/sqrt[(p*(1-p)*q*(1-q)] = -0.05/sqrt(0.35*0.15) = -0.218
Examples of LD estimation
Walsh/Gore
r2 = D2/[p*(1-p)*q*(1-q)] = 0.0025/(0.35*0.15) = 0.0476
16
Linkage disequilibrium
1 2Complete Disequilibrium
Modified from Rafalski 2002. Curr. Opin. Plant Biol. 5:94–100
6 0
0 6
Locus 1
Lo
cus
2
D′ = 1r2 = 1
1 2Complete Equilibrium
3 3
3 3
Locus 1
Lo
cus
2
D′ = 0
r2 = 0
* Complete LD between sites* Same mutational history* Low mapping resolution
* Pattern implies recombinationregardless of mutational history* High mapping resolution
Modified from Gaut and Long 2003. Plant Cell 15:1502-1506 17
1 2Partial Disequilibrium
6 3
0 3
Locus 1
Lo
cus
2
D′ = 1r2 = 0.33
1 2Complete Equilibrium
4 4
2 2
Locus 1
Lo
cus
2
D′ = 0
r2 = 0
* Site 2 may be a relatively new mutationwithout recombination* Moderate mapping resolution
* Pattern implies recombination
Modified from Rafalski 2002. Curr. Opin. Plant Biol. 5:94–100 Modified from Gaut and Long 2003. Plant Cell 15:1502-1506
Linkage disequilibrium
18
r2 in Association Mapping
1 2
SNP Marker Causative Variant
5 3
0 2
Locus 1
Lo
cus
2
D′ = 1
r2 = 0.25
SNP Marker will explain 25% of the total QTL variation, but only 2.5%of the total phenotypicvariation. Need large sample size.
r2 > 0.80 is recommendedfor association studies
Causative variant explains 10% of total phenotypic variance
19
Visualize the extent of linkage disequilibrium betweenpairs of loci: r2 vs. physical distance
Remington et al. 2001. PNAS 98:11479-11484Maize Dwarf3 (d3) gene
On average, intragenic LD rates rapidly decline tonominal levels (r2 < 0.1) within 2 kb in diverse maize
20
The line fitted to the data minimizes the sum of the squared differences between r2 and its expected value, assuming recombination scales with physical distance
Purple triangles are the r2 values ofeach SNP relative to the peak SNP(indicated in red).
The blue vertical lines are –log10 P-values for SNPs that are statisticallysignificant for αT at 5% FDR, whereas the gray vertical lines are –log10
P-values for SNPs that are non-significant at 5% FDR.
Plot of GWAS results for α-tocopherol (αT) content inmaize grain and LD (r2) across the ZmVTE4 region
Lip
ka e
t al.
20
13
. G3
3:1
28
7-1
299
21
There are many evolutionary forces that can shapelevels of LD in a population
• Natural and artificial selection• Recombination rate• Genetic drift• Mutation rate• Population structure• Population expansion/bottleneck• Admixture• Mating system
Slatkin 2008 Nature Reviews Genetics 9:477-48522
Buckler and Gore 2007. Nat. Genet. 39:1056-1057
Non-coding sitesSynonymous sites
LD range in major crops
23
• Genetic mapping• Linkage disequilibrium• Population structure• Complex trait dissection
Lecture 8: Association Mapping
24
Linkage vs. Association
• The important distinction between linkage and association issubtle, yet very critical
• Marker allele M is associated with the trait if Cov(M,y) ≠ 0
• While such associations can arise via linkage, they can alsoarise via population structure
• Thus, association DOES NOT imply linkage, and linkage isnot sufficient for association
Walsh25
Gm+ Total % with diabetes
Present 293 8%
Absent 4,627 29%
When a population being sampled consists of several distinct subpopulations, marker alleles may provide information as to which group an individual belongs. If there are other risk factors in a group, this can create a false association between a marker and trait.
Example. The Gm (human immunoglobulin G; IgG) marker was thought (IgG has an insulin like effect) to be an excellent candidate gene for non-insulin-dependent diabetes in the high-risk population of Pima Indians in the American Southwest.
Walsh/Gore26
Population structure: case/control samples in association mapping
Initially a highly significant negative association was observed between the Gm+ haplotype and diabetes:
Problem: The presence/absence of this haplotype is also a very sensitive indicator of admixture with the Caucasian population. The frequency of Gm+ is around 67% in Caucasians (lower-risk diabetes population) as compared to <1% in full-heritage Pima.
Gm+ Total % with diabetes
Present 17 59%
Absent 1,764 60%
The association was re-examined in a population of Pima adults (over age 35) that were 7/8th (or more) full heritage. Now the association betweenGm+ haplotype and disease disappeared:
Walsh/Gore27
The Gm+ marker is a predictor of diabetes not because it is linked to causative gene(s) but because it predicts whether an individual is from a specific population. Admixed individuals with significant fraction of genes from Caucasian extraction have a lower chance of carrying the causative gene(s), while gene(s) increasing risk of diabetes are at a higher frequency in full-heritage Pima.
• One measure of population structure is given by Wright’s FSTstatistic (also called the fixation index)
• Essentially, this is the fraction of genetic variation due tobetween-population differences in allele frequencies
• Changes in allele frequencies can be caused by evolutionaryforces such as genetic drift, selection, and local adaptation
• Consider a biallelic locus (A, a). If p denotes overall populationfrequency of allele A,– then the overall population variance is p(1-p)– Var(pi) = variance in p over subpopulations– FST = Var(pi)/[p(1-p)]
Walsh/Gore
FST, a measure of population structure
28
Population Freq(A)
1 0.1
2 0.6
3 0.2
4 0.7
Assume all subpopulationscontribute equally tothe overall metapopulation
Overall freq(A) = p =(0.1 + 0.6 + 0.2 + 0.7)/4 = 0.4
Var(pi) = E(pi2) - [E(pi)]2 = E(pi
2) - p2
Var(pi) = [(0.12 + 0.62 + 0.22 + 0.72)/4] - 0.42 = 0.065
Total population variance = p(1-p) = 0.4(1-0.4) = 0.24
Hence, FST = Var(pi) /[p(1-p) ] = 0.065/0.24 = 0.27
Example of FST estimation
Walsh/Gore29
P1
P2
p=0.5q=0.5
p=0.5q=0.5
FST = 0
Graphical example of FST
HomozygousDiploid
No population differentiation
30Modified from Escalante et al. 2004. Trends Parasitol. 20:388-395
P1
P2
p=0.9q=0.1
p=0.25q=0.75
FST=0.43
HomozygousDiploid
Graphical example of FST
Strong population differentiation
31Modified from Escalante et al. 2004. Trends Parasitol. 20:388-395
P1
P2
Modified from Escalante et al. 2004. Trends Parasitol. 20:388-395
p=1q=0
p=0q=1
FST = 1
HomozygousDiploid
Complete population differentiation
Graphical example of FST
32
Unrooted neighbor-joining tree based on C.S. Chord distance (Cavalli-Sforza and Edwards 1967) based on 169nuclear SSRs. The key relates the color of the line to the chloroplast haplotype based on ORF100 and PS-IDsequences.
Garris et al. 2005. Genetics 169:1631-1638
Rice population structure
*Admixed individuals
FST = 0.25
FST = 0.43
33
Liu et al. 2003. Genetics 165:2117-2128
Phylogenetic tree for 260 inbred lines using the log-transformed proportion of shared alleles distance
Maize population structure
Non-Stiff Stalk
Tropical/Subtropical
Stiff-Stalk
Teosinte
FST = 0.18
Flint-Garcia et al. 2005. Plant J. 144:1054-1064
FST = 0.22
34
• Genetic mapping• Linkage disequilibrium• Population structure• Complex trait dissection
Lecture 8: Association Mapping
35
y = Xb + Sa + Qv +Zu + e
Methodology development
Genomic technology
r2
Association mapping
p = 1e-7
Zhu et al. 2008. Plant Gen. 1:5-20
Genetic diversity
36
Construction of association mapping panels
http://shop.nativeseeds.org/products/don
http://www.oardc.ohio-state.edu/vanderknaap/caps_project.php http://b4fa.org/biosciences-and-agriculture/plantbreeding/genetic-diversity/
http://en.wikipedia.org/wiki/List_of_culinary_vegetables
37
Phenotyping association mapping panels
38
Photo from T. Rocheford
White/yellow seed color is controlled by 1 gene in maize
Yellow-orange gradient in seed color is controlled by 11 genes in maize
39
Quantitative resistance to northern leaf blight is controlled by at least 26 loci in maize
Flowering time is controlled by at least ~40 loci in maize• Buckler et al. 2009 Science 325:714-718
Traits such as biomass or grain yield are expected to besubstantially more complex
Poland et al. 2011. PNAS 108:6893-6898
40
y = Xb + Sa + Qv +Zu + e
Methodology development
Genomic technology
r2
Association mapping
p = 1e-7
Zhu et al. 2008. Plant Gen. 1:5-20
Genetic diversity
41
Molecular Diversity I: Genome to Genes
Genome Size
Ploidy LevelChromosome Number
Number of Genes
J. Poland
Arabidopsis 2n=2x=10 ~134 MbRice 2n=2x=24 ~420 MbSorghum 2n=2x=20 ~760 MbMaize 2n=2x=20 ~2,500 Mb Barley 2n=2x=14 ~4,900 MbCotton 2n=4x=52 ~2,500 Mb Oat 2n=6x=42 ~11,300 MbWheat 2n=6x=42 ~16,500 Mb
Arabidopsis 27,379Rice 35,679Sorghum 34,496Cotton 37,505Maize 32,540
http://www.gencodys.eu/Patient%20entry%20.php http://chibba.agtec.uga.edu/duplication/
42
Molecular Diversity II: Nucleotide variants in a population
Single-Nucleotide Polymorphism (SNP)
…TGAACCTAAGTATGTCCG…
…TGAACCTAAGTATGTCCG…
…TGAACCTAAGTATGTCCG…
…TGAACCTAGGTATGTCCG…
…TGAACCTAGGTATGTCCG…
…TGAACCTAGGTATGTCCG…A/G
SNP allele
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
43
The number of markers needed for association mapping with complete genome-wide coverage (r2 ≥ 0.8) depends on the following population, genomic, and genetic architecture parameters:
• Genome size• Rate of LD decay• Nucleotide diversity levels• Causative variant effect sizes
• Arabidopsis – ~200,000 SNPs
• Grape – ~2,000,000 SNPs
• Diverse maize – ~20,000,000 SNPs44
The costs of DNA sequencing have rapidly decreased
45
https://www.genome.gov
Doubling of “computer power” every two years
46https://www.genome.gov
Constructing a Reference Genome Sequence
Whole-Genome Resequencing
47The International HapMap Project
(a)SNPs are identified in DNA samples from multiple individuals.
(b)Adjacent SNPs that are inherited together are compiled into haplotypes.
(c) “Tag” SNPs are identified within haplotypes that uniquely describe those haplotypes
Three steps of HapMap construction
Poland and Rife 2012. Plant Gen. 5:92-102
Genotyping-by-sequencing (GBS): massive parallel sequencing of multiplex reduced-representation libraries
48
FILLIN first tries to impute the entire site window with one (1a) or two (1b) haplotypes (using the Viterbi HiddenMarkov Model [HMM] to model the recombination break point), then if that is unsuccessful tries to impute forsmaller windows, first with one haplotype (2a), then two with Viterbi (2b), finally by combining two haplotypes tomodel heterozygosity (2c). If this does not satisfy (lower) error thresholds, the smaller window is not imputed.Dashed arrows mean that the algorithm continues if conditions are not satisfied for imputation. Beagle v. 4 is stillpreferable for diverse heterozygous populations.
Fast Inbred Line Library ImputatioN (FILLIN) algorithm: rapidly and accurately impute missing genotypes in low-coverage, GBS-type data with ordered markers
Swarts et al. 2014 Plant Gen. 7:1-1249
Haplotypelibrary via clustering
Recombination breakpoint
Smaller windows
Genetic relatedness
y = Xb + Sa + Qv +Zu + e
Methodology development
Genomic technology
r2
Association mapping
p = 1e-7
Zhu et al. 2008. Plant Gen. 1:5-20
Genetic diversity
50
Pop
ulat
ion
stru
ctur
e
Familial relatedness
Yu et al. 2006. Nat. Genet. 38: 203-208Modified slide from Ed Buckler
Axi
s de
pict
s re
latio
nshi
ps a
mon
g m
ajor
sub
popu
latio
ns
asso
ciat
ed w
ith lo
cal a
dapt
atio
n or
div
ersi
fyin
g se
lect
ion
Axis depicts the relationships among individuals from recent coancestry
Maize
Rice
Human admixture
CEPH grandparents CEPH Utah family
Different types of samples used for association mapping
51
• (Line1,…, Linen) ~ MVN(0, , )
• K = kinship matrix
• εi ~ i.i.d. N(0, )
Phenotype of ith
individual
Grand Mean
Fixed effects: account for population structure
Marker effect
Observed SNP alleles of ith
individual
Random effects: account for familial relatedness
Random errorterm
Measures relatedness between individuals
Mixed linear models are used to reduce false positives in association mapping
Yu et al. 2006. Nat. Genet. 38:203-208 Lipka52
Q (population structure) + K (relatedness)
Yu et al. 2006. Nat. Genet. 38:203-208
Mod
el C
ompa
rison
53
y = Xb + Sa + Qv +Zu + e
Methodology development
Genomic technology
r2
Association mapping
p = 1e-7
Zhu et al. 2008. Plant Gen. 1:5-20
Genetic diversity
54
Huang et al. 2010. Nat. Genet. 42:961-967
Association mapping in 373 indica rice lines with nearlyone million SNPs: traits with a weak correlation withpopulation structure
qsw5
55
Association mapping in 373 indica rice lines with nearlyone million SNPs: traits with a strong correlation withpopulation structure
Huang et al. 2010. Nat. Genet. 42:961-96756
Low resolution
Small reference
population & allele
numbers
Balanced allele
frequency
Known population
structure
High resolution
Large reference
population & allele
numbers
Rare alleles
Cryptic population
structure
Linkage analysis vs. Association mapping
57
The maize Nested Association Mapping (NAM) panel
58
P1
P2
P25
×
B73
Pop1
Pop2
Pop25
.
.
.
.
.
.
.
.
.
5,000 RIL Linkage Map
Linkage resolution Linkage resolution
The 5,000 RILs are genotyped with 14k GBS SNPmarkers for NAM joint linkage
59
Pop1
Pop2
Pop25
5,000 RIL Linkage Map
P1
P2
P25
B73
×
.
.
.
.
.
.
.
.
.
NAM resolutionNAM resolution
Whole-genome resequencing of parents and impute30M SNPs onto recombination blocks
60
Yu et al. 2008. Genetics 178:539–551
QTL detection power: NAM panel size, heritability, and number of QTL
61
Tian et al. 2011. Nat. Genet. 43:159–162
Joint-Linkage/Association mapping of leaf traits in themaize NAM panel
62
Mike Gore lecture notesTucson Plant Breeding Institute
Module 1
Lecture 9: Inbreeding and crossbreeding
1
Lecture 9: Inbreeding and crossbreeding
2
• Inbreeding theory• Heterosis theory• Maize heterotic groups• Genetic models of heterosis• Genetic basis of heterotic loci• Crossing schemes to reduce loss of heterosis
• Inbreeding = mating of related individuals
• Often results in a change in the mean of a trait
• Inbreeding is intentionally practiced to:– Create genetic uniformity of laboratory stocks– Produce stocks for crossing (animal and plant breeding)
• Inbreeding is unintentionally generated:– By keeping small populations (such as is found at zoos)– During selection within animal and plant breeding programs
Inbreeding
3
Walsh
• The inbreeding coefficient, F
• F = Prob(the two alleles within an individual are IBD) ―identical by descent
• Hence, with probability F both alleles in an individual areidentical, and hence a homozygote
• With probability 1-F, the alleles are combined at random
4
Genotype frequencies under inbreeding
Walsh
Genotype Alleles IBD Alleles not IBD frequency
A1A1 Fp (1-F)p2 p2 + Fpq
A2A1 0 (1-F)2pq (1-F)2pq
A2A2 Fq (1-F)q2 q2 + Fpq
p A1
A2q
F
F
A1A1
A2A2
p
p A1A1
A2 A1
q
A2A1q
A2A2
Alleles IBD
1-F
1-F
Random mating
Alleles IBD
5
Walsh
mF = m0 - 2Fpqd
Using the genotypic frequencies under inbreeding, the population mean mF under a level of inbreeding F is related to the mean m0 under random mating by
6
Changes in the mean under inbreeding
Walsh
• There will be a change in mean value if dominance is present (d not 0)
• For a single locus, if d > 0, inbreeding will decrease the mean value ofthe trait. If d < 0, inbreeding will increase the mean.
• For multiple loci, a decrease (inbreeding depression) requires directionaldominance ― dominance effects di tending to be positive
• The magnitude of the change of mean on inbreeding depends on allelefrequency, and is greatest when p = q = 0.5 7
Walsh
Example for maize height
F1 F2 F3 F4 F5
8
Inbreeding depression
F6 F7
Walsh
9
Fitness traits and inbreeding depression
• Often seen that inbreeding depression is strongest on fitness-relative traits such as yield, height, etc.
• Traits less associated with fitness often show less inbreedingdepression
• Selection on fitness-related traits may generate directionaldominance
Walsh
10
Inbreeding depression in selfing lineages?
• Inbreeding depression is common in outcrossing species
• However, generally fairly uncommon in species with a highrate of selfing
• One idea is that the constant selfing has purged many of thedeleterious alleles thought to cause inbreeding depression
• However, lack of inbreeding depression also means a lack ofheterosis
Walsh
Inbreeding reduces variation within each population
Inbreeding increases the variation between populations(i.e., variation in the means of the populations)
F = 011
Variance changes under inbreeding
Walsh
F = 1/4
F = 3/4
F = 1
Between-group variance increases with F
Within-group variance decreases with F12
Walsh
• A series of inbred lines from an F2 population areexpected to show:
– more within-line uniformity (variance about the meanwithin a line)
• Less within-family genetic variation for selection
– more between-line divergence (variation in the meanvalue between lines)
• More between-family genetic variation for selection
13
Implications for traits
Walsh
General F = 1 F = 0
Between lines 2FVA 2VA 0
Within lines (1-F) VA 0 VA
Total (1+F) VA 2VA VA
The above results assume ONLY additive variance i.e., no dominance/epistasis. When non-additive variance is present, the results become very complex (see WL Chp. 3).
14
Variance changes under inbreeding
Walsh
Lecture 9: Inbreeding and crossbreeding
15
• Inbreeding theory• Heterosis theory• Maize heterotic groups• Genetic models of heterosis• Genetic basis of heterotic loci• Crossing schemes to reduce loss of heterosis
Springer and Stupar 2007. Genome Res. 17:264-275
Heterosis (or hybrid vigor) is one of the leastunderstood universal biological phenomena that hasbeen exploited by animal and plant breeders toincrease the productivity of domesticated species
http://www.cmbb.arizona.edu/?page_id=15indica japonicaindica japonica
16
Heterosis is the increased vigor, growth, size, yield, or function of hybrid progeny over the parents that results from crossing genetically unlike organisms
(1) Mid-parent (MP) heterosis = (F1 – MP)/MP × 100 = % MPH
(2) High-parent (HP) heterosis = (F1 – HP)/HP × 100 = % HPH
(3) Absolute heterosis = F1 – MP
17
Springer and Stupar 2007. Genome Res. 17:264-275
Maize phenotypes that show heterosis
(High-parent heterosis)
18
Yuan Longping
The average yield of hybrid rice is 7.2 t/ha, while other inbred varieties yield 5.9 t/ha. It is estimated that approximately 70million more people annually in China can be fed by planting hybrid rice
Father of Hybrid Rice: Longping developed first hybrid rice in China as a university professor in 1970s
http://english.cri.cn/8706/2013/06/05/3262s768613.htm
https://www.worldfoodprize.org/en/laureates/20002009_laureates/2004_jones_and_yuan/
19
When inbred lines are crossed, the progeny show an increase in mean for characters that previously suffered a reduction from inbreeding
This increase in the mean over the average value of theparents is called hybrid vigor or heterosis
A cross is said to show heterosis if H > 0, so that the F1 mean is larger than the average of both parents
20
Line crosses: Heterosis
Walsh
Expected levels of heterosis between two lines
• Heterosis depends on dominance: If d = 0, then there is neither inbreedingdepression nor heterosis. As with inbreeding depression, directional dominance (d > 0) is required for heterosis.
• H is proportional to the square of the difference in allele frequencies betweenparental lines from different populations. H is greatest when alleles are fixed in one population and lost in the other (so that |δpi| = 1). H = 0 if δp = 0.
• H is specific to each particular cross. H must be determined empirically, since weneither know the relevant QTL nor their allele frequencies.
21
Walsh/Gore
δpi is the allele frequency difference between two crossed lines at a locus with two alleles
In the F1, all offspring are heterozygotes. In the F2, random mating has occurred, reducing the frequency of heterozygotes.
As a result, there is a reduction of the amount of heterosis in the F2 relative to the F1,
If random mating occurs in the F2 and subsequentgenerations, the level of heterosis stays at the F2 level.
22
Walsh
Heterosis declines in the F2
Crop % planted as hybrids
% yield advantage
Annual added yield:
%
Annual added yield:
tons
Annual land savings
Maize 65 15 10 55 x 106 13 x 106 ha
Sorghum 48 40 19 13 x 106 9 x 106 ha
Sunflower 60 50 30 7 x 106 6 x 106 ha
Rice 12 30 4 15 x 106 6 x 106 ha
Crosses often show high-parent heterosis, wherein the F1 not only beats the average of the two parents (mid-parent heterosis), it exceeds the best parent.
23
Agricultural importance of heterosis
Walsh
Hybrid corn in the US
• Shull (1908) suggested objective of corn breeders shouldbe to find and maintain the best parental lines for crosses
• Initial problem: early inbred lines had low seed set
• Solution (Jones 1918): use a hybrid line as the seedparent, as it should show heterosis for seed set
• 1930’s - 1960’s: most corn produced by double crosses
• Since 1970’s, hybrids are mostly from single crosses
24
Walsh
A cautionary tale
• 1970-71 the plant disease Southern Corn Leaf Blightalmost destroyed the whole US corn crop
• Much larger (in terms of food energy) than the great potatoblight of the 1840’s
• Cause: Corn can self-fertilize, so to make hybrids eitherhave to manually detassel the pollen structures or usegenetic tricks that cause male sterility
• Almost 85% of US corn in 1970 had Texas cytoplasm Tcms,a mtDNA encoded male sterilty gene
• Tcms turned out to be hypersensitive to “race T” of thefungus Helminthosporium maydis. Billion dollar losses!
25
Walsh
Lecture 9: Inbreeding and crossbreeding
26
• Inbreeding theory• Heterosis theory• Maize heterotic groups• Genetic models of heterosis• Genetic basis of heterotic loci• Crossing schemes to reduce loss of heterosis
http://thescientistgardener.blogspot.com/2010/12/maize-is-machine.html
Combination of genetics and agronomic technologies has allowed a 1% or better gain per year in grain yields
Phillips 2010. Crop Sci. 50:S-99-S-108
27
Northern Flint Modern CBD Southern Dent
Photo: http://thescientistgardener.blogspot.com/2010/12/maize-is-machine.html
ReidYellow Dent OPV
LancasterSurecrop OPV
The making of corn belt dent
28
• Corn belt dent heterotic patterns are not the result ofhistorical or geographical influences
Tracy and Chandler 2006. pp 219–233
• Open-pollinatedvarieties and firstcycle inbreds didnot show heteroticpatterns whencrossed; therefore,markers would nothave been helpfulto identify heteroticgroups
The formation of maize heterotic groups
29
• Corn belt dent heterotic patterns were created bybreeders through trial and error
• In the 1940s, breeders started arbitrarily splitting thegermplasm pool into groups (odd vs. even numberedlines)
• Genetic drift created initial divergence in allelefrequencies, which was enhanced by selection
• Modern heterotic groups are the product ofdivergence from a homogenous landrace (OPV)population
Tracy and Chandler 2006. pp 219–233 van Heerwaarden et al. 2012. PNAS 109:12420-12425
The formation of maize heterotic groups
30
Lecture 9: Inbreeding and crossbreeding
31
• Inbreeding theory• Heterosis theory• Maize heterotic groups• Genetic models of heterosis• Genetic basis of heterotic loci• Crossing schemes to reduce loss of heterosis
Genetic hypotheses of heterosis
• Dominance hypothesis – masking of unfavorable recessivealleles in a heterozygote. Two or more loci are neededbecause the value of a heterozygote at a single locus (d>a)does not exceed the value of the superior parent.
• If true, it should be possible to obtain an inbred thatperforms equally as well as the best hybrid
• Overdominance hypothesis – the heterozygote is superiorover either homozygote. Only a single locus (d>a) isneeded to achieve heterosis. Also, linkage is not needed toachieve heterosis.
• If true, it should NOT be possible to obtain an inbred thatperforms equally as well as the best hybrid
Bernardo 2002. Breeding for Quantitative Traits in Plants pp 243-246
32
• Pseudo-Overdominance hypothesis – repulsion phaselinkage of loci that show partial or complete dominance
• The effects of two loci are difficult to separate if bothare tightly linked. If we did not know that two locicomprise a single linkage block, we would incorrectlyconclude that heterosis is due to overdominance.
• Pseudo-overdominance is similar to the two-locusdominance hypothesis, with the exception thatrepulsion phase linkage is required for pseudo-overdominance.
Genetic hypotheses of heterosis
Bernardo 2002. Breeding for Quantitative Traits in Plants pp 243-246
33
Genetic models for heterosis
Repulsion Phase Linkage –Superior A and B alleles createa superior phenotype from complementation
Allelic interactions –Heterozygosity at the B locuswith two functional alleles
Complementation –Slightly deleterious homozygous a, b, c alleles
Birchler et al. 2006. PNAS 103:12957-12958
34
Epistasis appears to have a major role in the genetic basis of heterosis in an elite rice hybrid
Yu et al. 1997. PNAS 94:9226-9231
The two indica rice lines, Zhenshan 97 and Minghui 63, are parents of Shanyou 63, one of the nost productive hybrids in China
Additive × Additive (AA); Additive × Dominance (AD); Dominance × Additive (DA); Dominance × Dominance (DD) 35
Lecture 9: Inbreeding and crossbreeding
36
• Inbreeding theory• Heterosis theory• Maize heterotic groups• Genetic models of heterosis• Genetic basis of heterotic loci• Crossing schemes to reduce loss of heterosis
Hill-Robertson effect
• H-R effect: linkage between sites under selectionreduces the overall effectiveness of selection for finitenatural populations
• Repulsion phase linkages among favorable alleles willreduce the effectiveness of selection
• Favorable alleles have a higher chance of being inrepulsion phase in the presence of low recombination
• If these favorable alleles exhibit dominance, then lowrecombination regions should be under high selectivepressure to maintain heterozygosity
Hill and Robertson 1966. Genet. Res. 8:269-294 McMullen et al. 2009. Science 325:737-740
37
-w
ithin
10
cM
on
ea
ch s
ide
o
f th
e c
en
tro
me
re p
osi
tion
-th
e r
est
of
the
ch
rom
oso
me
re
gio
ns
Residual heterozygosity is higher in pericentromeric regions and inversely correlated with recombination rate
Gore et al. 2009. Science 326:1115-1117
Hypothesized that these data support the dominance theory of heterosis andthe increase in heterozygosity near centromeres is a consequence of heterosis.This heterosis is likely the product of pseudo-overdominance, most pronouncedin pericentromeric regions, where the Hill-Robertson effect is strongest.
McMullen et al. 2009. Science 325:737-740
38
Fine mapping of a heterotic QTL for maize grain yield
Overdominant QTL on chr. 5 was dissected by NILs into twotightly, linked dominant effect QTL in repulsion phase. Providesevidence for pseudo-overdominance.
Graham et al. 1997. Crop Sci. 37:1601-1610
39
Meta-QTL analysis of heterosis
Concluded pseudo-overdominance is a major cause ofheterosis in maize and no significant epistasis. Heterotic QTLfor grain yield mapped near low R, pericentromeric regions (i.e.,likely repulsion phase)
Schön et al. 2010. Theor. Appl. Genet. 120:321-332
40
Krieger et al. 2010. Nat. Genet. 42:459-463
Overdominance: heterozygosity for tomato loss-of-function alleles of SINGLE FLOWER TRUSS (SFT), which is the genetic originator of the flowering hormone florigen, increases yield by up to 60%
Yield overdominance from SFT heterozygosity is robust, occurring in distinct genetic backgrounds and across multiple environments
Heterotic yield effects derive from a dosage-dependent suppression of growth termination mediated by SELF PRUNING (SP), an antagonist of SFT
41
Het have more inflorescences
Lecture 9: Inbreeding and crossbreeding
42
• Inbreeding theory• Heterosis theory• Maize heterotic groups• Genetic models of heterosis• Genetic basis of heterotic loci• Crossing schemes to reduce loss of
heterosis
Take n lines and construct an F1 population by making all pairwise crosses
43
Crossing Schemes to Reduce the Loss of Heterosis: Synthetics
Walsh
• Major trade-offs– As more lines are added, the F2 loss of heterosis declines
– However, as more lines are added, the mean of the F1also declines, as less elite lines are used
– Bottom line: For some value of n, F1 - H/n reaches amaximum value and then starts to decline with n
44
Synthetics
Walsh
• The F1 from a cross of lines A x B (typically inbreds) is called asingle cross
• A three-way cross (also called a modified single cross) refersto the offspring of an A individual crossed to the F1 offspring ofB x C.– Denoted A x (B x C)
• A double (or four-way) cross is (A x B) x (C x D), the offspringfrom crossing an A x B F1 with a C x D F1
45
Walsh
Types of crosses
• While single cross (offspring of A x B) are hard to predict,three- and four-way crosses can be predicted if we know themeans for single crosses involving these parents
• The three-way cross mean is the average mean of the twosingle crosses:– mean(A x {B x C}) = [mean(A x B) + mean(A x C)]/2
• The mean of a double (or four-way) cross is the average of allthe single crosses,– mean({A x B} x {C x D}) = [mean(AxC) + mean(AxD) +
mean(BxC) + mean(BxD)]/4
46
Walsh
Predicting cross performance
1
Bruce Walsh Notes Introduction to Plant Quantitative Genetics
Tucson. 8-10 Jan 2018
Lecture 10 Mass, Family, and Line
Selection
2
Topics • Breeder’s equation
– Outcrossing population (h2) – Clones (H2) – Truncation selection, selection intensity – Permanent vs. transient response
• Family selection – Different types of family selection
• Selfing – Selection while selfing
• Line selection – SSD, Pedigree, Bulk, and DH schemes – Early vs. late testing
3
Selection • Basic goal is to develope elite genotypes
– With an outcrossed population, this is done by increasing the frequency of favorable alleles
• Selection on additive (A), rather than interaction (D), effects
• In a large population, continual improvement (response) expected over a number of generations
– With inbred populations, this is done by generating a series of inbreds from a cross and picking the elite lines.
• Selection on genotypic values, hence additive + interaction effects
• Further progress depends upon generating new variation through additional crosses
4
Outcrossed Populations • Improvement in an outcrossed (or open pollinated)
population akin to what animal breeders do for improvement (called recurrent selection by plant breeders) – A within-generation change (the increase in trait mean
among the selected individuals) is translated into a between-generation change.
• Individuals are chosen either on the basis of – their phenotypic value (mass selection) – the performance of their offspring (progeny testing) or
relatives (such as sib or family selection) – Idea is to use such information to obtain estimates of
the breeding values of individuals
5
Response to Selection
• Selection can change the distribution of phenotypes, and we typically measure this by changes in mean – This is a within-generation change, namely the
selection differential S = µ* - µ • Selection can also change the distribution of
breeding values – This is the response to selection, the change in
the trait in the next generation (the between-generation change) R(t) = µ(t+1) - µ(t)
6
The Breeder’s Equation: Translating S into R Recall the regression of offspring value on midparent value
Averaging over the values of the selected midparents, E[ (Pf + Pm)/2 ] = µ*,
E[ yo - µ ] = h2 ( µ� - µ ) = h2 S
Likewise, averaging over the regression gives
Since E[ yo - µ ] is the change in the offspring mean, it represents the response to selection, giving:
R = h2 S The Breeder’s Equation (Jay Lush)
7
• Note that no matter how strong S, if h2 is small, the response is small
• S is a measure of selection, R the actual response. One can get lots of selection but no response
• If offspring are asexual clones of their parents, the breeders’ equation becomes – R = H2 S
• If males and females subjected to differing amounts of selection, – S = (Sf + Sm)/2
8
Pollen control • Recall that S = (Sf + Sm)/2 • An issue that arises in plant breeding is pollen
control --- is the pollen from plants that have also been selected?
• Not the case for traits (i.e., yield) scored after pollination. In this case, Sm = 0, so response only half that with pollen control
• Tradeoff: with an additional generation, a number of schemes can give pollen control, and hence twice the response – However, takes twice as many generations, so
response per generation the same
9
Selection on clones • Although we have framed response in an outcrossed
population, we can also consider selecting the best individual clones from a large population of different clones (e.g., inbred lines)
• R = H2S, now a function of the board sense heritability. Since H2 > h2, the single-generation response using clones exceeds that using outcrossed individuals
• However, the genetic variation in the next generation is significantly reduced, reducing response in subsequent generations – In contrast, expect an almost continual response for several
generations in an outcrossed population.
10
The Selection Intensity, i The selection intensity i is the selection differential expressed in terms of phenotypic standard deviations
Consider two traits, one with S = 10, the other S = 5. Which trait is under stronger selection? Can’t tell, because S is a function of the phenotypic variance of the trait
In contrast, i is scaled measure and hence allows for fair comparisons over different traits
11
Truncation selection • A common method of artificial selection is through
selection --- all individuals whose trait value is above some threshold (T) are chosen.
• Equivalent to only choosing the uppermost fraction p of the population
12
Truncation selection • The fraction p saved can be translated into an
expected selection intensity (assuming the trait is normally distributed), – allows a breeder (by setting p in advance) to
chose an expected value of i before selection, and hence set the expected response
– – – R code for i: dnorm(qnorm(1-p))/p
p 0.5 0.2 0.1 0.05 0.01 0.005
i 0.798 1.400 1.755 2.063 2.665 2.892
Height of a unit normal at the threshold value corresponding to p
13
Selection Intensity Versions of the Breeders’ Equation
R = h 2 S = h 2 S σp
σp = i h 2 σp
Since h2σP = (σ2A/σ2
P) σP = σA(σA/σP) = h σA
R = i h σA
Since h = correlation between phenotypic and breeding values, h = rPA
R = i rPAσA
Response = Intensity * Accuracy * spread in Va
When we select an individual solely on their phenotype, the accuracy (correlation) between BV and phenotype is h
14
Accuracy of selection More generally, we can express the breeders equation as
R = i ruAσA
Here we select individuals based on the index u (for example, the mean of n of their sibs).
ruA = the accuracy of using the measure u to predict an individual's breeding value = correlation between u and an individual's BV, A
15
16
Improving accuracy • Predicting either the breeding or genotypic
value from a single individual often has low accuracy --- h2 and/or H2 (based on a single individuals) is small – Especially true for many plant traits with high G x E – Need to replicate either clones or relatives (such as
sibs) over regions and years to reduce the impact of G x E
– Likewise, information from a set of relatives can give much higher accuracy than the measurement of a single individual
17
Stratified mass selection • In order to accommodate the high
environmental variance with individual plant values, Gardner (1961) proposed the method of stratified mass selection – Population stratified into a number of different
blocks (i.e., sections within a field) – The best fraction p within each block are chosen – Idea is that environmental values are more similar
among individuals within each block, increasing trait heritability.
18
Family selection • Low heritabilites of traits a major issue
– High G x E, esp. year-to-year – Single phenotypes very poor predictors of A, G. – Hence, often grow out relatives in field trails
(multiple plots over a range of regions and years -- better sampling of G x E)
• Within- versus between-family selection • Response under general family selection • Lush’s family index
19
Different types of family-based selection
Uppermost fraction p chosen, m families each with n sibs, total N = mn
20
Within- vs. Between-family selection
Between-family response
Within-family response
21
Within, Between or Individual selection? Which scheme is best departs on the trait heritability and the intraclass correlation t among sibs
Between-family response > individual when
Low heritability, small common-family variance
22
Within-family response > individual when
Requires low heritability, and a large c. c2Var(z) = between family common variance thus accounts for much of the total trait variance
23
24
25
General response • The basic idea is that a parent P generates a set of
relatives x1 through xn whose phenotypes are measured (selection unit or group)
• Based on the performance of the selection group, relatives Ri of the best parents is chosen to represent Pi in forming the next generation (recombination unit or group) – R can be
• The parent itself • Progeny (measured or unmeasured) of the parent, could
be half- or full-sibs to the selection group, could also be the seed from selfing the parent
• Other relatives of P
26
P1
x1
y
P2
x21R 2R
Chooses parents
Crossed to make offspring
Response in next generation
Key: The covariance σ(xi,y) between a member in the selection group and the offspring is critical to predicting selection response, closely related to σ(xi,ARi), the cov bwt selection unit and BV of R
Expected response is the average breeding values of the Ri.
27
Response under general family selection
• Recall the accuracy version of the breeder’s equation, R = i ruAσA
• In our context, – i is the selection intensity between selection units – u is the value of the selection unit – A the breeding value in the corresponding
member R of the recombination unit, σ2A the
additive variance in the recombination unit • The correlation ruA is obtained from standard resemblance
between relatives calculations (full details in WL Chapter 19)
28
29
30
Offspring - selection unit covariances
31
Offspring - selection unit covariances (cont)
32
Variance in the selection unit
33
Response
34
Specific schemes: ear-to-row • A common scheme in corn breeding is to plant the
seeds from an ear as rows – Each row is thus a half-sib family (this is the
selection unit) – Some seed from ear saved (these form the
recombination unit) – Suppose total N seeds per ear grown as np rows
of ns sibs in ne environments
Family x E interaction
35
Modified ear-to-row
• Lonnquist (1964) proposed ear-to-row • Combines ear-to-row (between family) with
selection within row (within-family selection) • Plant the seeds from the ear into two sets of
rows. One set is several rows over multiple environments. Select best ears based on this performance
• Grow out residual seed from these best ears in a single row, then select best from each family within each row
36
Best Best
37
Response Total R = R from ear-to-row + R from within-row
38
Lush’s family index Finally, Lush suggested that an index weighting both within- and between-family values is optimal
The optimal weights are given by
Ratio of response/individual selection
> 1
39
1.00.90.80.70.60.50.40.30.20.10.00.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
n = 2 n = 4 n = 10 n = 25 n = 50
t
b2 /
b1
Half-sibs
More emphasis onwithin-family deviations
More emphasis onbetween-family deviations
40
1.00.90.80.70.60.50.40.30.20.10.00.00.1
0.20.30.40.5
0.60.70.8
0.91.01.11.2
1.31.4
n = 2 n = 4 n = 10 n = 25 n = 50
t
b2 /
b1
Full-sibs
More emphasis on within-family deviations
More emphasis onbetween-family deviations
41
42
43
Permanent Versus Transient Response
Considering epistasis and shared environmental values, the single-generation response follows from the midparent-offspring regression
Permanent component of response
Transient component of response --- contributes to short-term response. Decays away to zero
over the long-term
44
Permanent Versus Transient Response
The reason for the focus on h2S is that this component is permanent in a random-mating population, while the other components are transient, initially contributing to response, but this contribution decays away under random mating
Why? Under HW, changes in allele frequencies are permanent (don’t decay under random-mating), while LD (epistasis) does, and environmental values also become randomized
45
Response with Epistasis
c is the average (pairwise) recombination between loci involved in A x A
46
Response with Epistasis
Contribution to response from epistasis decays to zero as linkage disequilibrium decays to zero
Response from additive effects (h2 S) is due to changes in allele frequencies and hence is permanent. Contribution from A x A due to linkage disequilibrium
47
Why unselected base population? If history of previous selection, linkage disequilibrium may be present and the mean can change as the disequilibrium decays
More generally, for t generation of selection followed by τ generations of no selection (but recombination)
RAA has a limiting value given by
Time to equilibrium a function of c
48
What about response with higher-order epistasis?
Fixed incremental difference that decays when selection
stops
49
Response in autotetraploids
• Autotraploids pass along two alleles at each locus to their offspring
• Hence, dominance variance is passed along • However, as with A x A, this depends upon
favorable combinations of alleles, and these are randomized over time by transmission, so D component of response is transient.
50
P-O covariance Single-generation response
Response to t generations of selection with constant selection differential S
Response remaining after t generations of selection followed by τ generations of random mating
Contribution from dominance quickly decays to zero
Autotetraploids
51
General responses • As we have seen with both individual and family
selection, the response can be thought of as a regression of some phenotypic measurement (such as the individual itself or its corresponding selection unit value x) on either the offspring value (y) or the breeding value RA of the individual in the recombination group
• The regression slope for predicting y from x is σ (x,y)/σ2(x) and σ (x,RA)/σ2(x) for predicting the BV RA from x
• With transient components of response, these covariances now also become functions of time --- e.g. the covariance between x in one generation and y several generations later
52
Ancestral Regressions When regressions on relatives are linear, we can think of the response as the sum over all previous contributions
For example, consider the response after 3 gens:
8 great-grand parents S0 is there selection differential β3,0 is the regression coefficient for an offspring at time 3 on a great-grandparent From time 0
4 grandparents Selection diff S1 β3,1 is the regression of relative in generation 3 on their gen 1 relatives
2 parents
53
Ancestral Regressions
βT,t = cov(zT,zt)
More generally,
The general expression cov(zT,zt), where we keep track of the actual generation, as oppose to cov(z, zT-t ) -- how many generations Separate the relatives, allows us to handle inbreeding, where the (say) P-O regression slope changes over generations of inbreeding.
54
Selfing • Finally, let’s consider selection when an F1 is formed
and then selfed for several generations to generate inbred lines
• How best to advance lines to full inbreeding while selecting. – Should advancement occur first, and the we can
chose among the result inbred lines – Or should some selection (early testing) be
occurring while the lines are being inbred. • First, we will consider the response with
simultaneous selfing and selection – Inbreeding removes within-line variation,
enhances between-line variation
55
Details for computing these genetic covariances are given in Walsh & Lynch Chapter 20 (online)
56
Selection of the best pure lines • A very common setting in plant breeding is when
two (or more) inbreds are crossed and the resulting F1 continually selfed to form a series of inbred lines
• This is different from selecting elite lines among a set of already inbred lines, as the breeder also has to advance the lines to fully inbreds, which often takes time, in addition to trying to select the best ones. – Most accurate measure of performance (given G x
E) are multiregional trails wherein a subset of the advancing lines are measured over a series of regions and years.
57
Advancing to full inbreds • How best to combine advancing a line to being fully
inbred while still selecting (testing) them. – Tradeoff between less accurate testing a large
number of lines (but more variation kept) vs. more accurate testing of a smaller number of lines (representing less variation)
• Methods – Single seed descent (SSD) – Doubled haploids (DH) – Bulk Selection – Pedigree selection
58
SSD, DH selection • Under SSD (single-seed descent), single
seeds are used to advance a series of lines to full inbreeding, then selection (choosing among them) occurs, usually through mulitregional trails
• Single seeds are used to reduce any effects of selection during inbreed
• Under DH (double haploids), inbred lines are formed in one generation – Less chance for decay of any LD relative to SSD,
but effect likely to be small
59
Bulk Selection
• Seeds from natural selfers are grown and harvest in bulk over multiple generations – One problem is the natural selection during the
advancing of generations does select on yield (leaving more descendants) but also on other traits.
– Often tall plants are naturally selected during the bulk over higher yielding short plants
60
Bulk selection response
Atlas wins, but Vaughn best from an agricultural standpoint (higher yield -- 107% of Atlas, earlier heading date, better disease resistance)
61
Pedigree selection • Not to be confused with the pedigree-based
selection using BLUP (which very formally uses information for all relatives in selection decisions).
• Under pedigree selection (aka pedigree breeding), as individuals become more inbred, selection decisions shift from individual plants towards family-based performance
• High heritability traits selected early (individual selection), lower heritability traits (e.g., those with high G x E) selected later (family selection allowing for replication over different G’s)
62
Family-based selection
Individual selection
Individual selection
Family-based selection
Pedigree selection
63
Early generation testing • Much debate on the effectiveness of early
generation testing • Effectiveness depends on a high correlation
between phenotypes in the tested generations and the final genotypes of their selfed descendants – Concerns: genotypes selected early can be
different from their descendant full inbred offspring
– Basic idea: OK for high heritability traits – Not so good for low-heritability traits
64
EGYT = early generation yield traits, BS = Bulk Selection, DH = Doubled haploids SSD = single seed descent, PS = Pedigree selection
Effectiveness of different methods