+ All Categories
Home > Documents > 6.047/6.878 Lecture 23: Population Genetic...

6.047/6.878 Lecture 23: Population Genetic...

Date post: 31-Jul-2021
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
20
6.047/6.878 Lecture 23: Population Genetic Variation Guest Lecture by Pardis Sabeti Scribed by Mohammad Ghassemi, Jonas Helfer, Ben Mayne (2012), Alex McCAuley (2010), Matthew Lee (2009), Arjun K. Manrai and Clara Chan (2008) December 13, 2012 1
Transcript
Page 1: 6.047/6.878 Lecture 23: Population Genetic Variationweb.mit.edu/6.047/book-2012/Lecture23_PardisSabeti...6.047/6.878 Lecture 23: Population Genetic Variation Guest Lecture by Pardis

6.047/6.878 Lecture 23: Population Genetic Variation

Guest Lecture by Pardis SabetiScribed by Mohammad Ghassemi, Jonas Helfer, Ben Mayne (2012),

Alex McCAuley (2010), Matthew Lee (2009), Arjun K. Manrai and Clara Chan (2008)

December 13, 2012

1

Page 2: 6.047/6.878 Lecture 23: Population Genetic Variationweb.mit.edu/6.047/book-2012/Lecture23_PardisSabeti...6.047/6.878 Lecture 23: Population Genetic Variation Guest Lecture by Pardis

Contents

1 Introduction 3

2 Population Selection Basics 42.1 Polymorphisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Allele and Genotype Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 Hardy-Weinberg Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.2 Wright-Fisher Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Ancestral State of Polymorphisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Measuring Derived Allele Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Genetic Linkage 93.1 Correlation Coefficient r2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Natural Selection 104.1 Genomics Signals of Natural Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.1.1 Examples of Negative (Purifying) Selection . . . . . . . . . . . . . . . . . . . . . . . . 104.1.2 Examples of Positive (Adaptive) Selection . . . . . . . . . . . . . . . . . . . . . . . . . 114.1.3 Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5 Human Evolution 135.1 A History of the Study of Population Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . 135.2 Understanding Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.3 Understanding Recent Population Admixture . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6 Current Research 186.1 HapMap project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.2 1000 genomes project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

7 Further Reading 18

2

Page 3: 6.047/6.878 Lecture 23: Population Genetic Variationweb.mit.edu/6.047/book-2012/Lecture23_PardisSabeti...6.047/6.878 Lecture 23: Population Genetic Variation Guest Lecture by Pardis

6.047/6.878 Lecture 23: Population Genetic Variation

List of Figures

1 Plot of genotype frequencies for different allele frequencies . . . . . . . . . . . . . . . . . . . . 52 Changes in allele frequency over time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 A comparison of the hetrozygous and homozygous derived and damaging genotypes per indi-

vidual in an African American (AA) and European American (EA) population study. . . . . 84 Two isolated populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Approximate Time Table of Effects Sabeti et al. Science 2006 . . . . . . . . . . . . . . . . . . 116 Localized positive selection for Malaria resistance within species Sabeti et al. Science 2006 . 127 Localized positive selection for lactase persistence allele Sabeti et al. Science 2006 . . . . . . 128 Mean allele frequency difference of height SNPs, matched SNPS, and genome-wide SNPS

between Northern- and Southern-European populations Turchin et al., Nature Genetics (2012) 139 Broken haplotype as a signal of natural selection . . . . . . . . . . . . . . . . . . . . . . . . . 1310 A depiction of two major bottleneck events, one in the founding population from Africa,

and other, smaller subsequent bottleneck events in the East Asian and Western Europeanpopulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

11 An illustration of two bottleneck events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1612 The figure illustrate the effects of a bottleneck events on the number of rare Alleles in a

population. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1713 A depiction of European admixture levels in the Mexican, and African American populations. 1714 As illustration of the magnitude and origin of migrants based on the tract length and number

of tracts in the admixed population. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1 Introduction

For centuries, biologists had to rely on morphological and phenotypical properties of organisms in order toinfer the tree of life and make educated guesses about the evolutionary history of species. Only recently, theability to cheaply sequence entire genomes and find patterns in them has transformed evolutionary biology.Sequencing and comparing genomes on a molecular level has become a fundamental tool that allows us togain insight into much older evolutionary history than before, but also to understand evolution at a muchsmaller resolution of time. With these new tools, we can not only learn the relationship between distantclades that separated billions of years ago, but also understand the present and recent past of species andeven different populations inside a species.

In this chapter we will discuss the study of Human genetic history and recent selection. The methodolog-ical framework of this section builds largely on the concepts from previous chapters. Most specifically, themethods for association mapping of disease and phylogenetic constructs such as tree building among speciesand genes, and the history of mutations using coalescence. Having learned about these methods in the lastchapter, we now will study how their application can inform us about the relationships, and differencesbetween human populations. Additionally, we will look for how these differences can be exploited to look forsignals of recent natural selection and the identification of disease loci. We will also discuss in this chapterwhat we currently know about the differences between human populations and describe some parameterswe can infer that quantify population differences, using only the extent genetic variation we observe. In thestudy of Human Genetic history and recent selection, there are two principal topics of investigation whichare often studied. The first is the history of population sizes. The second is the history of interactionsbetween populations. Questions are often asked about these areas because the answers can often provideknowledge to improve the disease mapping process. Thus far, all present research based knowledge of humanhistory was found by investigating functionally neutral regions of the genome, and assuming genetic drift.The reason that neural regions are employed is because mutations are subject to positive, negative and bal-ancing selection pressure, when they take place on a functional region. Hence investigating a neural regionsprovides a selection unbiased proxy for the drift between species. In this chapter we will delve into some ofthe characteristics of selection process in humans and look for patterns of human variation in terms of crossspecies comparisons, comparison synonymous and non-synonymous mutations, and haplotype structure.

3

Page 4: 6.047/6.878 Lecture 23: Population Genetic Variationweb.mit.edu/6.047/book-2012/Lecture23_PardisSabeti...6.047/6.878 Lecture 23: Population Genetic Variation Guest Lecture by Pardis

6.047/6.878 Lecture 23: Population Genetic Variation

2 Population Selection Basics

2.1 Polymorphisms

Polymorphisms are differences in appearance amongst members of the same species. Many of them arisefrom mutations in the genome. These mutations, or genetic polymorphisms, can be characterized intodifferent types.

Single Nucleotide Polymorphisms (SNPs)

• The mutation of only a single nucleotide base within a sequence. In most cases, these changes arewithout consequence. However, there are some cases where the mutation of a single nucleotidehas a major effect.

• For example, is caused by a from A to T, that causes a change from glutamic acid (GAG) tovaline (GTG) in hemoglobin.

Variable Number Tandem Repeats

• When a short sequence is repeated multiple times, DNA Polymerase can sometimes ”slip”, causingit to make either too many or too few copies of the repeat. This is called a .

• For example, Huntingtons disease that is caused by too many repeats of the trinucleotideCAG repeat in the HTT gene. Having more than 36 repeats can lead to gradual muscle controlloss and severe neurological degradation. Generally, the more repeats there are, the stronger thesymptoms.

Insertion/Deletion

• Through faulty copying or DNA-repair, or of one or multiple nucleotides can occur.

• If the insertion or deletion is inside an exon (the protein-coding region of a gene) and does notconsist of a multiple of three nucleotides, a will occur.

• Prime example is deletions in the CFTR gene, which codes for chloride channels in the lungsand may cause Cystic Fibrosis where the patient cannot clear mucous in the lungs and causesinfection

Did You Know?DNA profiling is based on short variable number tandem repeats (STR). DNA is cut with certainrestriction enzymes, resulting in fragments of variable length that can be used to identify an in-dividual. Different countries use different (but often overlapping) loci for these profiles. In NorthAmerica, a system based on 13 loci is used.

2.2 Allele and Genotype Frequencies

In order to understand the evolution of a species through analysis of alleles or genotypes, we must have amodel of how the alleles are passed on from one generation to another. It is of immense importance thatthe reader has a firm intuition for the Hardy-Weinberg Principle and Wright fisher model before continuing.Hence, we will provide here a short reminder of modelling the history of mutations via the these methods.First introduced over a hundred years ago, the Wright-Fisher Model is a mathematical model of genetic driftin a population. Specifically, it describes the probability of obtaining k copies of a new allele p within apopulation of size N, with a non-mutant frequency of q, and what its expected frequency will be in successivegenerations.

4

Page 5: 6.047/6.878 Lecture 23: Population Genetic Variationweb.mit.edu/6.047/book-2012/Lecture23_PardisSabeti...6.047/6.878 Lecture 23: Population Genetic Variation Guest Lecture by Pardis

6.047/6.878 Lecture 23: Population Genetic Variation

2.2.1 Hardy-Weinberg Principle

The states that allele and genotype frequencies within a population will remain constant unless there is anoutside influence that pushes them away from that equilibrium.The Hardy-Weinberg principle is based on the following assumptions:

• The population observed is very large

• The population is isolated, i.e. there is no introduction of another subpopulation into the generalpopulation

• All individuals have equal probability of producing offspring

• All mating in the population is at random

• No random mutations occur in the population from one generation to the next

• Allele frequency drives future genotype frequency (Prevalent allele drives Prevalent genotype)

In a Hardy-Weinberg Equilibrium, for two alleles A and a, occurring with probability p and q = 1p,respectively, the probabilities of a randomly chosen individual having the homozygous AA or aa (pp or qq,respectively) or heterozygous Aa or aA (2pq) genotypes can be described by the equation:

p2|2pq|q2 = 1

This equation gives a table of probabilities for each genotype, which can be compared with the observedgenotype frequencies using statistical error tests such as the chi-squared test to determine if the Hardy-Weinberg model is applicable. Figure 1 shows the distribution of genotype frequencies at different allelefrequencies.

Figure 1: Plot of genotype frequencies for different allele frequencies

In natural populations, the assumptions made by the Hardy-Weinberg principle will rarely hold. Naturalselection occurs, small populations undergo genetic drift, populations are split or merged, etc. In Naturea mutation will always either disappear (frequency = 0) from the population or become prevalent in aspecies - this is called fixation; in general, 99% of mutations disappear. Figure 2 shows a simulation of amutations prevalence in a finite-sized population over time: both perform random walks, with one mutationdisappearing and the other becoming prevalent:

Once a mutation has disappeared, the only way for it to reappear is the introduction of a new mutationinto the population. For humans, it is believed that a given mutation under no selective pressure shouldfixate to 0 or 1 (within, e.g., 5%) within a few million years. However, under selection this will happen muchfaster.

5

Page 6: 6.047/6.878 Lecture 23: Population Genetic Variationweb.mit.edu/6.047/book-2012/Lecture23_PardisSabeti...6.047/6.878 Lecture 23: Population Genetic Variation Guest Lecture by Pardis

6.047/6.878 Lecture 23: Population Genetic Variation

Figure 2: Changes in allele frequency over time

2.2.2 Wright-Fisher Model

Under this model the time to fixation is 4N and the probability of fixation is 1/2N. In general Wright-Fisheris used to answer questions related to fixation in one way or another. To make sure your intuitions aboutthe method are absolutely clear considering the following questions:

FAQ

Q: Say you have a total of 5 mutations on a chromosome among a population of size 30, on average,how many mutations will be present in the next generation if each entity produces only onechild?

A: If each parent has only one offspring, then there will be, on average, 5 mutations in the nextgeneration because the expectation of allele frequencies is to remain constant according to theHardy-Weinberg equilibrium principle in basic biology.

FAQ

Q: Is the Hardy-Weinberg Equilibrium principle’s assumption about constant allele frequencyreasonable?

A: No, the reality is far more complex as there is stochasticity in population size and selection ateach generation. A more appropriate way to envision this is to image drawing alleles from aset of parents, with the amount of alleles in the next generation varying with the size of thepopulation. Hence the frequency in the next generation could very well go up or down. Notehere that if the allele frequency goes to zero it will always be at zero. The probability at eachsuccessive generation is lower if it’s under negative selection and higher if it’s under positiveselection. Hence if it’s a beneficial mutation the fixation time will be smaller, if the mutationis deleterious the fixation will be larger. If there are no offspring with a given mutation, thenthere won’t be any decedents with that mutation either. If one produces multiple offspringhowever, who in turn produce multiple offspring of their own, then there is a greater chancethat this allele frequency will rise.

6

Page 7: 6.047/6.878 Lecture 23: Population Genetic Variationweb.mit.edu/6.047/book-2012/Lecture23_PardisSabeti...6.047/6.878 Lecture 23: Population Genetic Variation Guest Lecture by Pardis

6.047/6.878 Lecture 23: Population Genetic Variation

FAQ

Q: Consider that the average human individual carries roughly 100 entirely unique mutations. So,when an individual produces offspring we could expect that half (or 50) of those mutationsmay appear in the child because in each sperm or egg cell, 50 of those mutations will bepresent, on average. Hence the offspring of an individual are likely to inherit approximately100 mutations, 50 from one parent, and 50 from another in addition to their own uniquemutations which come from neither parent. With this in mind, one might be interested inunderstanding what the chances are of some mutations appearing in the next generation if anindividual produces, say, n children. How can one do this?

A: Hint: To compute this value, we assume that some allele originates in the founder, at somearbitrary chromosome (1 for example). Then we ask the question, how many chromosome 1sexist in the entire population? At the moment, the size of the human population is 7 Billion,each carrying two copies of chromosome 1.

The

above questions and answers should make it painfully clear that the standard Hardy-Weinberg assumptionof allele frequencies remaining constant from one generation to the next is violated in many natural casesincluding migration, genetic mutation, and selection. In the case of selection, this issue is addressed bymodifying the formal definition to include a S, term which measures the skew in genotypes due to selection.See table 1 for a comparison of the original and selection compensated versions:

Behavior With only drift With drift and selection

n in next generation Mean: n(= 2Np), Dist: Binomial(2N, p) Mean: n(1 + s1+ps ), Dist: Binomial(2N, p 1+s

1+ps )

Time to fixation 4N 4N1+ 3

8N |s|(1+ 1

2 (lnN)|s|1+|s| )

Probability of fixation 12N

1−e−2s

1−e−4Ns

Table 1: Comparison of Wright-Fisher Model With Drift, Versus Drift and Selection

The main point to take away from Table 1, and this section of the chapter is that weather you haveselection or not, it is highly unlikely that a single allele will fixate in a population. If you have a verysmall population, however, then the chances of an allele fixating are much better. This is often the case inhuman populations, where there are often small, interbred populations which allow for mutations to fix in apopulation after only a few generations, even if the mutation is deleterious in nature. This is precisely whywe tend to see recessive deleterious mandolin disorders in isolated populations.

2.3 Ancestral State of Polymorphisms

How can we determine for a given polymorphism which version was the and which one is the mutant? Theancestral state can be inferred by comparing the genome to that of a closely related species (e.g. humansand chimpanzees) with a known phylogenetic tree. Mutations can occur anywhere along the phylogenetictree sometimes mutations at the split fix differently in different populations (“fixed difference”), in whichcase the entire populations differ in genotype. However, recent mutations will not have had enough time tobecome fixed, and a polymorphism will be present in one species but fully absent in the other as simultaneousmutations in both species are very rare. In this case, the “derived variant” is the version of the polymorphismappearing after the split, while the ancestral variant is the version occuring in both species.

2.4 Measuring Derived Allele Frequencies

The the frequency of the derived allele in the population can be easily calculated, if we assume that thepopulation is homogeneous. However, this assumption may not hold when there is an unseen divide between

7

Page 8: 6.047/6.878 Lecture 23: Population Genetic Variationweb.mit.edu/6.047/book-2012/Lecture23_PardisSabeti...6.047/6.878 Lecture 23: Population Genetic Variation Guest Lecture by Pardis

6.047/6.878 Lecture 23: Population Genetic Variation

Figure 3: A comparison of the hetrozygous and homozygous derived and damaging genotypes per individualin an African American (AA) and European American (EA) population study.

two groups that causes them to evolve separately as shown in figure 4.

Figure 4: Two isolated populations

In this case the prevalence of the variants among subpopulations is different and the Hardy-Weinbergprinciple is violated.

One way to quantify this difference is to use the (Fst) to compare subpopulations within a species.In reality only a portion of the total heterozygosity in a species is found in a given subpopulation. Fstestimates the reduction in heterozygosity (2pq with alleles p and q) expected when 2 different populationsare erroneously grouped together. Given a population having n alleles with frequencies pi where (1 ≤ i ≤ n),the homozygosity G of the population is calculated as:

Σni=1pi

2

The total heterozygosity in the population is given by 1-G.

Fst = Heterozygosity(total)−Heterozygosity(subpopulation)Heterozygosity(total)

8

Page 9: 6.047/6.878 Lecture 23: Population Genetic Variationweb.mit.edu/6.047/book-2012/Lecture23_PardisSabeti...6.047/6.878 Lecture 23: Population Genetic Variation Guest Lecture by Pardis

6.047/6.878 Lecture 23: Population Genetic Variation

In the case shown in figure 4 there is no heterozygosity between the populations, so Fst = 1. In realitythe Fst will be small within one species. In humans, for example, it is only 0.0625. For in practise, the Fst iscomputed either by clustering sub-populations randomly or using an obvious characteristic such as ethnicityor origin.

3 Genetic Linkage

In the simple models we’ve seen so far, alleles are assumed to be passed on independently of each other.While this assumption generally holds in the long term, in the short term we will generally observe a thatcertain alleles are passed on together more frequently than expected. This is termed genetic linkage.

The , also known as Mendel’s second law states:

Alleles of different genes are passed on independently from parent to offspring.

When this “law” holds, there is no correlation between different polymorphisms and the probability ofa haplotype (a given set of polymorphisms) is simply the product of the probabilities of each individualpolymorphism.

In the case where the two genes lie on different chromosomes this assumption of independence generallyholds, but if the two genes lie on the same chromosome, they are more often than not passed on together.Without genetic recombination events, in which segments of DNA on homologous chromosomes are swapped(crossing-over), the alleles of the two genes would remain perfectly correlated. With however, the correlationbetween the genes will be reduced over several generations. Over a suitably long time interval, recombinationwill completely remove the linkage between two polymorphisms; at which point they are said to be inequilibrium. When, on the other hand, the polymorphisms are correlated, we have Linkage Disequilibrium(LD). The amount of disequilibrium is the difference between the observed haplotype frequencies and thosepredicted in equilibrium.

The linkage disequilibrium can be used to measure the difference between observed and expected assort-ments. If there are two alleles (1 and 2) and two loci (A and B) we can calculate haplotype probabilitiesand find the expected allele frequencies.

• Haplotype frequencies

– P (A1) = x11

– P (B1) = x12

– P (A2) = x21

– P (B2) = x22

• Allele frequencies

– P11 = x11 + x12

– P21 = x21 + x22

– P12 = x11 + x21

– P22 = x12 + x22

• D = P11 ∗ P22P12 ∗ P21

Dmax, the maximum value of D with given allele frequencies, is related to D in the following equation:D′ = D

Dmax

D′ is the maximum linkage disequilibrium or complete skew for the given alleles and allele frequencies.Dmax can be found by taking the smaller of the expected haplotype frequencies P (A1, B2) or P (A2, B1). Ifthe two loci are in complete equilibrium, then D′ = 0. If D′ = 1, there is full linkage.

The key point is that relatively recent mutations have not had time to be broken down by crossing-overs.Normally, such a mutation will not be very common. However, if it is under positive selection, the mutation

9

Page 10: 6.047/6.878 Lecture 23: Population Genetic Variationweb.mit.edu/6.047/book-2012/Lecture23_PardisSabeti...6.047/6.878 Lecture 23: Population Genetic Variation Guest Lecture by Pardis

6.047/6.878 Lecture 23: Population Genetic Variation

will be much more prevalent in the population than expected. Therefore, by carefully combining a measureof LD and derived allele frequency, we can determine if a region is under positive selection.

Decay of is driven by recombination rate and time (in generations) and has an exponential decay. For ahigher recombination rate, linkage disequilibrium will decay faster in a shorter amount of time. However, thebackground recombination rate is difficult to estimate and varies depending on the location in the genome.Comparison of genomic data across multiple species can help in determining these background rates.

3.1 Correlation Coefficient r2

Answers how predictive an allele at locus A is of an allele at locus B

r2 = D2

P (A1)P (A2)P (B1)P (B2)

As the value of r2 approaches 1, the more two alleles at two loci are correlated. There may be linkagedisequilibrium between two haplotypes, even if the haplotypes are not correlated at all. The correlationcoefficient is particularly interesting when studying associations of diseases with genes, where knowing thegenotype at locus A may not predict a disease whereas locus B does. There is also the possibility whereneither locus A nor locus B are predictive of the disease alone but loci A and B together are predictive.

4 Natural Selection

In the mid 1800s the concept of evolution was not an uncommon idea, but it wasn’t before Darwin andWallace proposed natural selection as the mechanism that drives evolution in nature that the theory ofevolution got widespread recognition. It took 70 years (1948) until J.B.S Haldanes Malaria Hypothesisfound the first example for natural selection in humans. He showed a correlation between genetic mutationsin red blood cells and the distribution of malaria prevalence and discovered that individuals who had aspecific mutation that made them suffer from sickle cell anaemia also gave made them resistant to malaria.

Lactose tolerance (lasting into adulthood) is another example of natural selection. Such explicit exampleswere hard to prove without genome sequences. With whole genome sequencing readily available, we can nowsearch the genome for regions with the same patterns as these known examples to identify further regionsundergoing natural selection.

4.1 Genomics Signals of Natural Selection

• Ka/Ks ratio of non-synonymous to synonymous changes per gene

• Low diversity and many rare alleles over a region (ex Tajima’s D with regard to sickel-cell anemia)

• High derived allele frequency (or low) over a region (ex Fay and Wu’s H)

• Differentiation between populations faster than expected from drift (Measured with Fst)

• Long haplotypes: evidence of selective sweep.

• Exponential prevalence of a feature in sequential generations

• Mutations that help a species prosper

4.1.1 Examples of Negative (Purifying) Selection

• Across species we see negative selection of new mutations in conserved functional elements (exons,etc.).

• New alleles within one species tend to have lower allele frequencies if the allele is non-synonymousthan synonymous. Lethal alleles have very low frequencies.

10

Page 11: 6.047/6.878 Lecture 23: Population Genetic Variationweb.mit.edu/6.047/book-2012/Lecture23_PardisSabeti...6.047/6.878 Lecture 23: Population Genetic Variation Guest Lecture by Pardis

6.047/6.878 Lecture 23: Population Genetic Variation

Figure 5: Approximate Time Table of EffectsSabeti et al. Science 2006

4.1.2 Examples of Positive (Adaptive) Selection

• Similar to negative selection in that positive selection more likely in functional elements or non-synonymous alleles.

• Across species in a conserved element, a positively selected mutation might be the same over mostmammals, but change in a specific species because a positvely selected mutation appeared after speci-ation or caused speciation.

• Within a species positvely selected alleles likely differ in allele frequency (Fst) across populations.Examples include malaria resistance in African populations (6) and lactose persistence in Europeanpopulations (7).

• Polygenic selection within species can arise when a trait is selected for that depends on manygenes. An example is human height where 139 SNPs are known to be related to height. Most are notpopulation specific mutations but alleles across all humans that are seleced for in some populationsmore than others. (8)

4.1.3 Statistical Tests

• Long range correlations (iHs, Xp, EHH): If we tag genetic sequences in an individual based ontheir ancestry, we end up with a broken haplotype, where the number of breaks (color changes) iscorrelated with the number of recombinations and can tell us how long ago a particular ancestry wasintroduced.

• SWEEP A program developed by Pardis Sabeti, Ben Fry and Patrick Varilly. SWEEP detectsevidence of natural selection by analyzing haplotype structures in the genome using the long rangehaplotype test (LRH). It looks for high frequency alleles with long range linkage disequilibrium thathints to large scale proliferation of a haplotype that occurred at a rate greater than recombinationcould break it from its markers .

11

Page 12: 6.047/6.878 Lecture 23: Population Genetic Variationweb.mit.edu/6.047/book-2012/Lecture23_PardisSabeti...6.047/6.878 Lecture 23: Population Genetic Variation Guest Lecture by Pardis

6.047/6.878 Lecture 23: Population Genetic Variation

Figure 6: Localized positive selection for Malaria resistance within speciesSabeti et al. Science 2006

Figure 7: Localized positive selection for lactase persistence alleleSabeti et al. Science 2006

• High Frequency Derived Alleles Look for large spikes in the frequency of derived alleles in setpositions.

• High Differentiation (Fst) Large spikes in differentiation at certain positions.

Using these tests, we can find genomic regions under selective pressure. One problem is that a single SNPunder positive selection will allow nearby SNPs to piggy-back and ride along. It is difficult to distinguishthe SNP under selection from its neighbours with only one test. Under selection, all the tests are stronglycorrelated; however, in the absence of selection they are generally independent. Therefore, by employing acomposite statistic built from all of these tests, it is possible to isolate the individual SNP under selection.

Examples where a single SNP has been implicated in a trait:

12

Page 13: 6.047/6.878 Lecture 23: Population Genetic Variationweb.mit.edu/6.047/book-2012/Lecture23_PardisSabeti...6.047/6.878 Lecture 23: Population Genetic Variation Guest Lecture by Pardis

6.047/6.878 Lecture 23: Population Genetic Variation

Figure 8: Mean allele frequency difference of height SNPs, matched SNPS, and genome-wide SNPS betweenNorthern- and Southern-European populationsTurchin et al., Nature Genetics (2012)

Figure 9: Broken haplotype as a signal of natural selection

• Chr15 Skin pigmentation in Northern Europe

• Chr2 Hair traits in Asia

• Chr10 Unknown trait in Asia

• Chr12 Unknown Trait in Africa

5 Human Evolution

5.1 A History of the Study of Population Dynamics

Not surprisingly, the scientific community has a long, and somewhat controversial history of interest inrecent population dynamics. While indeed some of this interest was applied toward more nefarious aims,such as the scientific justifications for racism for eugenics but these are increasingly the exception andnot the rule. Early studies of population dynamic were primitive in many ways. Quantifying the differencesbetween human populations was originally performed using blood types, as they seemed to be phenotypicallyneutral, could be tested for outside of the body, and seemed to be polymorphic in many different human

13

Page 14: 6.047/6.878 Lecture 23: Population Genetic Variationweb.mit.edu/6.047/book-2012/Lecture23_PardisSabeti...6.047/6.878 Lecture 23: Population Genetic Variation Guest Lecture by Pardis

6.047/6.878 Lecture 23: Population Genetic Variation

populations. Fast forward to the present, and the scientific community has realized that there are otherglycoproteins beyond the A,B and O blood groups that are far more polymorphic in the population. Asscience continued to advance and sequencing became a reality, they began whole genome sequencing of theY-chromosome, mitochondrial and microsatellite markers around them. What’s special about those twotypes of genetic data? First and foremost, they are quite short so they can be sequenced more easily thanother chromosomes. Beyond just the size, the reason that the Y and mitochondrial chromosomes were ofsuch interest is because they do not recombine, and can be used to easily reconstruct inheritance trees.This is precisely what makes these chromosomes special relative to a short chunk on an autosome; we knowexactly where it comes from because we can trace paternal or maternal lineage backward in time.

This type of reconstruction does not work with other chromosomes. If one were to generate a tree usinga certain chunk of all of chromosome 1 in a certain population, for instance, they would indeed form aphylogeny but that phylogeny would be picked from random ancestors in each of the family trees.

As sequencing continued to develop and grow more effective, the human genome project was beingproposed, and along with it there was a strong push to include some sort of diversity measure in genomicdata. Technically speaking, it was easiest to simply look at microsatellites for this diversity measure becausethey can be studied on gel to see size polymorphisms instead of inspecting a sequence polymorphism. As areminder, a microsatellite is a region of variable length in the human genome often characterised by shorttandem repeats. One reason for microsatellites is retroviruses inserting themselves into the genome, such asthe ALU elements in the human genome. These elements sometimes become active and will retro-transposeas insertion events and one can trace when those insertion events have happened in human lineage. Hence,there was a push, early on to assay these parts of the genome in a variety of different populations. Thereally attractive thing about microsatellites is that they are highly polymorphic and one can actually infertheir rate of mutation. Hence, we can not only say that there is a certain relationship between populationsbased on these rates, but we can also say how long they have been evolving and even when certain mutationsoccurred, and how long it’s been on certain branches of the phylogenetic tree.

FAQ

Q: Can’t this simply be done with SNPs

A: You can’t do it very easily with SNPs.

You can get an idea of how old they are based on their allele frequency, but they’re also going to beinfluenced by selection.

After the human genome project, came the Haplotype inheritance Hapmap project which looked atSNPs genome wide. We have discussed Haplotype inheritance in detail in prior chapters where we learnedthe importance of Hapmap in designing genotyping arrays which look at SNPs that mark common haplotypesin the population.

The effects of Bottlenecks on Human diversity Using this wealth of data across studies and a plethoraof mathematical techniques has led to the realization that humans, in fact, have a very low diversity givenour census population; which implies a small effective population size. Utilizing the Wright-Fisher model itis possible to work back from the level of diversity and the number of mutations we see in the populationtoday to generate a founding population size. When this computation is performed it works out to beingaround 10,000.

14

Page 15: 6.047/6.878 Lecture 23: Population Genetic Variationweb.mit.edu/6.047/book-2012/Lecture23_PardisSabeti...6.047/6.878 Lecture 23: Population Genetic Variation Guest Lecture by Pardis

6.047/6.878 Lecture 23: Population Genetic Variation

FAQ

Q: Why is this so much smaller than our census population size?

A: There was A population bottleneck somewhere.

Figure 10: A depiction of two major bottleneck events, one in the founding population from Africa, andother, smaller subsequent bottleneck events in the East Asian and Western European populations.

Most of the total variation between humans is happening within-continent. One can measure how muchdiversity is explained by geography and how much is not. It turns out that most of it is not explained bygeography. In fact, most common variants are polymorphic in every population and if a common variantis unique to a given population, there probably hasn’t been enough time for that to happen by drift itself.Recall what an unlikely process it is to get to a high allele frequency over the course of several generationsby mere chance alone. Hence, we may interpret this as a signal of selection when it occurs. All of theevidence in terms of comparing diversity patterns and trees back to ancestral haplotypes converges to anOut-of-Africa hypothesis which is the overwhelming consensus in the field and is the lens through which wereview all the genetic population data. Starting from the African founder population, there have been workswhich have demonstrated that it’s possible to model population growth using the wright fisher model. Thestudies have shown that the growth rate we see in Asian and European populations are only consistent withlarge exponential growth after the out-of-Africa event.

This helps us understand the reasons for phonotypical differences between the races as Bottlenecks whichare followed by exponential growth can lead to an excess of rare alleles. The present theory on humandiversity states that there were secondary bottleneck events after the founding population migrated out ofAfrica. These founders were, at some earlier point subject to an even smaller bottleneck event which isnow reflected in every human genome on the planet, regardless of their immediate ancestry. It is possibleto estimate how small the original bottle neck was by looking at differences between African and European

15

Page 16: 6.047/6.878 Lecture 23: Population Genetic Variationweb.mit.edu/6.047/book-2012/Lecture23_PardisSabeti...6.047/6.878 Lecture 23: Population Genetic Variation Guest Lecture by Pardis

6.047/6.878 Lecture 23: Population Genetic Variation

Table 2: Genetic Estimates of Recent Population Growth in Europe

origin individuals, inferring the effects of the secondary bottleneck, and the term of exponential growth ofthe European population. The other way of approaching bottleneck event estimation is to simply inspect theallele frequency spectrum needed to build coalescent trees. In this way, one can take haplotypes across thegenome and ask what the most recent common ancestor was by observing how the coalescence varies acrossthe genome. For instance, one may guess that some haplotype was positively selected for only recently giventhe length of the haplotype. An example of one such recent mutation in the European population is thelactase gene. Another example for the Asian population is the ER locus.

There is a wealth of literature showing that when one draws a coalescence tree for most haplotypes itends up going way back before when we think speciation happened. This indicates that certain features havebeen kept polymorphic for a very long time. One can, however, look at this distribution of features acrossthe whole genome and infer something about population history from it. If there was a recent bottle neck ina population, it will be reflected by the ancestors being very recent whereas more ancient things will havesurvived the bottleneck. One can take the distribution of coalescent times and run simulations for how theeffect of population size would have varied with time. The model for doing this type of study was outlinedby Li and Durbin. The Figure 11 from their study illustrates two such bottleneck events. The first is thebottleneck which occurred in Africa long before migrations out of the continent. This was then followed bya population specific bottleneck that resulted from migration groups out of Africa. This is reflected in thediversity of the populations today based on their ancestry and it can be derived from looking at a pair ofchromosome from any two people in these populations.

Figure 11: An illustration of two bottleneck events

16

Page 17: 6.047/6.878 Lecture 23: Population Genetic Variationweb.mit.edu/6.047/book-2012/Lecture23_PardisSabeti...6.047/6.878 Lecture 23: Population Genetic Variation Guest Lecture by Pardis

6.047/6.878 Lecture 23: Population Genetic Variation

5.2 Understanding Disease

Understanding that human populations went through bottlenecks has important implications for under-standing population specific disease. A study published by Tennessen et al. this year was looking at exomesequences in many classes of individuals. The study intended to look at how rare variants might be con-tributing to disease and as a consequence they were able to fit population genetics models to the data, andask what sort of deleterious variants were seen when sequencing exomes from a broad population panel.Using this approach, they were then able to generate parameters which describe how long ago exponentialgrowth between the founder, and branching populations occured. See figure 12 below for an illustration ofthis:

Figure 12: The figure illustrate the effects of a bottleneck events on the number of rare Alleles in a population.

5.3 Understanding Recent Population Admixture

In addition to viewing coalescent times, one can also perform Principal Component Analysis on SNPs togain an understanding of more recent population admixtures. Running this on most populations showsclustering with respect to geographical location. There are some populations, however, that experienced arecent admixture for historical reason. The two most commonly referred to in the scientific literature are:African Americans, who on average are 20

Figure 13: A depiction of European admixture levels in the Mexican, and African American populations.

17

Page 18: 6.047/6.878 Lecture 23: Population Genetic Variationweb.mit.edu/6.047/book-2012/Lecture23_PardisSabeti...6.047/6.878 Lecture 23: Population Genetic Variation Guest Lecture by Pardis

6.047/6.878 Lecture 23: Population Genetic Variation

There are two major things one can say about the admixture event of African Americans and MexicanAmericans. The first and more obvious is inferring the admixture level. The second, and more interesting,is inferring when the admixture event happened based on the actual mixture level. As we have discussed inprevious chapters, the racial signifiers of the genome break down with admixture because of recombinationin each generation. If the population is contained, the percentage of those with European and West Africanorigin should stay the same in each generation, but the segments will get shorter, due to the mixing. Hence,the length of the haplotype blocks can be used to date back to when the mixing originally happened. (Whenit originally happened we would expect large chunks, with some gambits being entirely of African origin,for instance.) Using this approach, one can look at the distribution of recent ancestry traps and then fit amodel to when these migrants entered an ancestral population as shown below:

Figure 14: As illustration of the magnitude and origin of migrants based on the tract length and number oftracts in the admixed population.

6 Current Research

6.1 HapMap project

The International Project aims to catalog the genomes of humans from various countries and regions and findsimilarities and differences to help researchers find genes that will benefit the advance in disease treatmentand administration of health related technologies.

6.2 1000 genomes project

The 1000 Genomes Project is an international consortium of researchers aiming to establish a detailedcatalogue of human genetic variation. Its aim was to sequence the genomes of more than a thousandanonymous participants from a number of different ethnic groups. In October 2012, the sequencing of 1092genomes was announced in a Nature paper. It is hoped that the data collected by this project will helpscientists gain more insight into human evolution, natural selection and rare disease-causing variants.

7 Further Reading

• Campbell Biology, 9th edition; Pearson; Chapter 23: The Evolution of Populations

• The Cell, 5th edition, Garland publishing ; Chapters 5: DNA replication, repair and recombination,Chapter 20: Germ cells and fertilization

18

Page 19: 6.047/6.878 Lecture 23: Population Genetic Variationweb.mit.edu/6.047/book-2012/Lecture23_PardisSabeti...6.047/6.878 Lecture 23: Population Genetic Variation Guest Lecture by Pardis

6.047/6.878 Lecture 23: Population Genetic Variation

References

19

Page 20: 6.047/6.878 Lecture 23: Population Genetic Variationweb.mit.edu/6.047/book-2012/Lecture23_PardisSabeti...6.047/6.878 Lecture 23: Population Genetic Variation Guest Lecture by Pardis

6.047/6.878 Lecture 23: Population Genetic Variation

20


Recommended