+ All Categories
Home > Documents > Mod01 Resources notes

Mod01 Resources notes

Date post: 10-Dec-2023
Category:
Upload: independent
View: 1 times
Download: 0 times
Share this document with a friend
81
Course Notes for the edX course Useful Genetics – Part 1 by Professor Rosemary Redfield Notes by Katrien De Cock
Transcript

Course Notes for the edX course

Useful Genetics – Part 1

by Professor Rosemary Redfield

Notes by Katrien De Cock

Contents

1 How different are we? 3Lecture 1.A How different are we? . . . . . . . . . . . . . . . . . . . . . . 3

1.A.1 Differences . . . . . . . . . . . . . . . . . . . . . . . . . . 31.A.2 Phenotype and genotype . . . . . . . . . . . . . . . . . . . 31.A.3 DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.A.4 Life Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.A.5 Differences between genomes . . . . . . . . . . . . . . . . 6

Lecture 1.B The properties of DNA . . . . . . . . . . . . . . . . . . . . . 71.B.1 DNA as a physical molecule . . . . . . . . . . . . . . . . . 81.B.2 DNA as a genetic information carrier . . . . . . . . . . . . 10

Lecture 1.C How we represent DNA . . . . . . . . . . . . . . . . . . . . . 121.C.1 Representation vs reality in genetics . . . . . . . . . . . . 121.C.2 DNA is coiled and folded . . . . . . . . . . . . . . . . . . 14

Lecture 1.D History of DNA . . . . . . . . . . . . . . . . . . . . . . . . . 151.D.1 DNA replication . . . . . . . . . . . . . . . . . . . . . . . 151.D.2 The Cell Theory in biology . . . . . . . . . . . . . . . . . 171.D.3 DNA’s evolutionary continuity . . . . . . . . . . . . . . . 18

Lecture 1.E What makes some DNA sequences genes? . . . . . . . . . . 191.E.1 RNA and protein . . . . . . . . . . . . . . . . . . . . . . . 191.E.2 Genes are information in DNA . . . . . . . . . . . . . . . 211.E.3 DNA information becomes protein information by tran-

scription and translation . . . . . . . . . . . . . . . . . . . 22Lecture 1.F Coding for proteins . . . . . . . . . . . . . . . . . . . . . . . 23

1.F.1 mRNA must be decoded in protein synthesis . . . . . . . 241.F.2 The ‘Genetic Code’ is the codebook . . . . . . . . . . . . 251.F.3 Transfer RNAs translate the code . . . . . . . . . . . . . 251.F.4 Reading frames . . . . . . . . . . . . . . . . . . . . . . . . 27

Lecture 1.G More about genes . . . . . . . . . . . . . . . . . . . . . . . . 291.G.1 Introns and splicing . . . . . . . . . . . . . . . . . . . . . 291.G.2 How cells and geneticists identify genes . . . . . . . . . . 31

Lecture 1.H What makes these processes so confusing? . . . . . . . . . . 331.H.1 Common features of DNA replication, transcription and

translation and their different functions . . . . . . . . . . 331.H.2 A text analogy . . . . . . . . . . . . . . . . . . . . . . . . 341.H.3 Why is it so hard to keep them straight? . . . . . . . . . . 35

1

Lecture 1.I What is a chromosome? . . . . . . . . . . . . . . . . . . . . 361.I.1 One very long DNA molecule . . . . . . . . . . . . . . . . 361.I.2 Information for 100s or 1000s of genes . . . . . . . . . . . 361.I.3 Regulatory signals . . . . . . . . . . . . . . . . . . . . . . 361.I.4 Human chromosomes . . . . . . . . . . . . . . . . . . . . . 371.I.5 Two terms for genes . . . . . . . . . . . . . . . . . . . . . 381.I.6 How we can represent chromosomes . . . . . . . . . . . . 38

Lecture 1.J Genes on Chromosomes . . . . . . . . . . . . . . . . . . . . 421.J.1 Dive into a chromosome to resolve a single gene . . . . . . 421.J.2 How genes are arranged . . . . . . . . . . . . . . . . . . . 44

Lecture 1.K DNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . . 461.K.1 Modern DNA Sequencing . . . . . . . . . . . . . . . . . . 461.K.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . 47

Lecture 1.L Homology in biology and in genes . . . . . . . . . . . . . . . 491.L.1 Homology . . . . . . . . . . . . . . . . . . . . . . . . . . . 491.L.2 How to decide if similarities are due to homology . . . . . 521.L.3 Homologous chromosomes . . . . . . . . . . . . . . . . . . 53

Lecture 1.M Life Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541.M.1 The cell cycle . . . . . . . . . . . . . . . . . . . . . . . . . 541.M.2 Typical plant/animal life cycles . . . . . . . . . . . . . . . 55

Lecture 1.N Ploidy and recombination . . . . . . . . . . . . . . . . . . . 581.N.1 Ploidy: sexual reproduction alternates haploid and dip-

loid cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581.N.2 Recombination creates new versions of genomes, by reas-

sortment and crossing over . . . . . . . . . . . . . . . . . 59Lecture 1.O Genetic variation in populations . . . . . . . . . . . . . . . . 62

1.O.1 Alleles in individuals and populations . . . . . . . . . . . 631.O.2 Kinds of DNA sequence differences . . . . . . . . . . . . . 641.O.3 Comparing DNA sequences . . . . . . . . . . . . . . . . . 651.O.4 Allele frequencies in populations . . . . . . . . . . . . . . 671.O.5 Haplotypes . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Lecture 1.P Genetic and evolutionary relationships of human populations 721.P.1 Human similarity . . . . . . . . . . . . . . . . . . . . . . . 721.P.2 Similarities of other species . . . . . . . . . . . . . . . . . 721.P.3 Human origins: Out of Africa . . . . . . . . . . . . . . . . 731.P.4 Did humans really mate with Neanderthals? . . . . . . . . 75

Index 78

2

Module 1

How different are we?

Lecture 1.A How different are we?

Outline

In this very first lecture of the first module of Useful Genetics we will start inSection 1.A.1 with thinking about human differences. We are going to introducesome very basic concepts, such as phenotype and genotype (Section 1.A.2) andalso the very first introduction to DNA (Section 1.A.3). In Section 1.A.4 thehuman life cycle is briefly described and we end in Section 1.A.5 by looking atgenetic differences.

1.A.1 Differences

As people we’re really good at noticing differences in appearance of other people.That’s probably because we’re social animals. We’re also really good at noticingthe ways in which people from any particular part of the world look differentfrom us, but very much like each other. It’s often said that we’re all the sameunder the skin. But are we really? This is a question that genetics can answer.

1.A.2 Phenotype and genotype

To think about how different we really are, we need to introduce two veryimportant terms that will come up again and again and again: phenotype andgenotype.

These terms were invented by the very first geneticists as a way of dis-tinguishing between two things. At first, the phenotype was the observableproperties of organisms, what you could see, the features that we cared about.And it’s come to include not just what you can see with your eyes, but theobservable model, the molecular properties, behavioural properties, anythingabout an organism.

3

Figure 1.1: Phenotype and genotype

The term genotype was invented when the brave inference was made thatthese phenotypic differences that we see, were caused by differences in somemysterious entities which were given the name genes. Different versions of geneswere the different genotypes. We now know that this was a correct inferenceand in fact, that the different versions of genes are in fact different versions ofDNA sequences of chromosomes.

An overview of the original and modern meanings of phenotype and genotypeis given in Figure 1.1.

1.A.3 DNA

In this section, a quick refresher about DNA is given.All of our genetic material is organized into long molecules called chromo-

somes. And these long molecules are chains of subunits called bases, repre-sented in Figure 1.2. DNA actually consists of two intertwined chains (a doublehelix), but for simplicity we will consider one chain now.

Figure 1.2: Each chromosome is a very long molecule, a chain of subunits calledbases.

In DNA and other informational molecules, the bases aren’t all the same. In-stead, they come in four slightly different versions (see Figure 1.3), four slightlydifferent subunits, called by the names A, G, T, and C, which are just the singleletter abbreviations of their chemical names. We often represent DNA as the or-der of the bases in the sequence because in fact, this order of the sequences is the

4

genetic information. Differences in the order of the bases cause the phenotypicdifferences that we see.

Figure 1.3: There are four slightly different kinds of subunits: A, G, T andC. The precise order of the bases is very important because it is the geneticinformation.

1.A.4 Life Cycle

We are now going to think about DNA in the life cycle. Each egg and eachsperm contains one complete set of DNA molecules, 23 chromosomes. Whenthe egg and the sperm meet, when the sperm fuses with the egg, the resultingfertilized egg contains two complete sets of genes, one from each parent, seeFigure 1.4.

Individuals develop because the fertilized egg undergoes many many cyclesof growth and cell division, producing all of the cells in our body. And, what’svery important from a genetic perspective, all of the cells in our body containthe same two sets of genes that were in the original egg and sperm.

What is this complete set of DNA molecules?One complete set consists of 23 long chains called chromosomes.

How many subunits are we talking about?We’re talking about three billion subunits, 3 · 109.

A complete set of DNA molecules is called a genome.When we speak of the human genome, we mean a reference sequence. But

when we think about humans, we need to think about the genomes of all humans,and the genetic diversity between us. Together this makes up the real geneticendowment of all humans.

Figure 1.4: The fertilized egg contains two complete sets of chromosomes, onefrom each parent.

5

Figure 1.5: 1,000 bases of the set you inherited from mum (red) and 1,000 basesfrom dad (blue).

Figure 1.6: Overlaying 1,000 of mum’s and dad’s bases, we see there is only 1difference.

1.A.5 Differences between genomes

How different then, are our DNA sequences? Well, from one perspective, theyare very, very similar. On, average they’re 99.9% identical. So, 1 base in 1000is different.

A way to visualize that is to consider the set of DNA that you inheritedfrom your mum, and the set that you inherited from your dad, shown in redand blue, respectively, in Figure 1.5. Here I’m just showing 1,000 bases. If youcompare them, what you’ll find is that on average, in every 1,000 bases, there’sjust one sequence difference, see Figure 1.6.

What if we were to compare a set of DNA from you and from one of yourneighbours? Again, we picked 1,000 bases, and compare you and your neighbour.On average, there is 1 DNA sequence difference. Even if we were to comparetwo sets from people from other sides of the world, again we’ll see on average

6

there’s just 1 difference. So, it doesn’t make any difference really how far awaywe come from. We’re all very similar.

But there are two perspectives. And the other perspective is to think abouthow different we are. How many genetic differences are there between youand your neighbour? The answer is: there are six million differences. A lotof differences between you and your neighbour, between each of us and eachother one of us. Where did this number come from? Here’s the calculation:

110002 · 3 · 109 = 6 · 106. There’s 1 difference per 1,000 bases. There are threebillion bases per set and there are two sets per person. So that gives us sixmillion differences.

How much phenotypic difference do these genotypic differences make? Mostof these differences don’t make any detectable difference to the phenotype. Andthat’s because most of these genetic differences are in parts of the genome thatare really not important for determining our phenotypic properties at all. Butthe rest of the genetic differences cause all of the heritable differences betweenpeople.

Just in case you’re thinking that people are kind of special, we’re not. Ex-actly the same situation exists for all of the other organisms on the planet. Thesequences of any one species are very similar, but there are so many bases thatthey’re still a lot of differences, which cause all of phenotypic differences.

Lecture 1.B The properties of DNA

Outline

We’re going to introduce the properties of DNA, thinking both about DNA asa physical molecule with physical properties (Section 1.B.1) and as an informa-tional molecule (Section 1.B.2). We will find out how these physical propertiesallow DNA to be the carrier of genetic information. In Figure 1.7 you can seean overview of the physical and informational properties of DNA.

Figure 1.7: Physical and informational properties of DNA

7

1.B.1 DNA as a physical molecule

DNA is the kind of molecule that’s called a polymer. As we said earlier, it’sa chain of identical or nearly identical subunits. In DNA’s case, as an infor-mational polymer, the subunits need to be different because it’s the order ofthe subunits that creates the information. In DNA, there are four kinds of thesubunits called bases. They go by the names A, G, C and T.

DNA typically has two strands. We talk about it as if it was one molecule,but really the functional DNA in the cell is two molecules wound around eachother. We often draw DNA as in Figure 1.8 to indicate that.

Figure 1.8: DNA consists of two molecules wound around each other.

In these two strands, the bases that are across from each other are physicallycomplementary. They’re not identical, but they fit together. The meaningof that is shown in Figure 1.9. It is just a schematic representation of thefour bases as four different shapes attached to a line that represents the DNAstrand. The important feature about these four shapes is that they are pairwisecomplementary. The shape of the A base fits with the shape of the T base, theshape of the G base fits with the shape of the C base. But they don’t fit inother combinations. If we try to pair an A with a C, or a C with a T, eitherthey don’t fit or they bump together as in Figure 1.10.

This fit between A and T and between C and G is mediated both by com-plementary shapes and by complementary charges, the mixtures of pluses andminus charges on different parts of the base molecules. These interactions ofshapes and charges together create a kind of chemical interaction called hy-drogen bonding. These are individually weak but collectively strong chemicalbonds that serve several very important physical functions for DNA:

Figure 1.9: A schematic representation of two DNA strands. The four bases arefour different shapes that pairwise fit together. The A base shape fits with theT base shape and the C base shape fits with the G base shape.

8

Figure 1.10: The A, C, G and T bases do not fit in other combinations thanA-T and C-G.

• they hold the DNA strands together so they don’t come apart,

• they direct DNA replication so that a new strand can be synthesized usingthe old strand,

• and they direct the synthesis of the related molecule called RNA.

In Figure 1.11 is a schematic drawing of DNA. The DNA in each chromo-some is two very long single strands, and you can follow these in this chemicalstructure along this backbone of each strand.

The bonds that hold the strands together between one subunit and the nextsubunit, are strong bonds. They’re not going to come apart at all. In contrast,the bases are held together by hydrogen bonds, which are weak bonds. They’re

Figure 1.11: In this drawing of the DNA molecule, we see the paired basesAdenine, Thymine, Cytosine and Guanine, the weak hydrogen bonds betweenthe two strands and strong bonds between the bases in the same strand. Thetwo strands have opposite directions, indicated by the 3’ and 5’ ends.

9

fairly easily pulled apart. Individually, they’re weak. In the whole molecule,summed up over all the bonds in the molecule, they’re very strong holding theDNA together. But it’s relatively easy to pry individual bonds apart. This isimportant because strands have to separate when the DNA replicates, or whenthe bases are transcribed into RNA to direct protein synthesis.

One other point is that these molecules are asymmetric. They’re like text.The letters in our alphabet are asymmetric. If you write them backwards,they mean something different. And the text that we read has to be read ina particular direction to make sense. In Western languages, we read from leftto right. In the same way, the DNA strands are asymmetric. We can think ofeach strand as having a direction and we refer to the ends by numbers: threeprime (3’) and five prime (5’).

One final feature of the double helix. Even though the two DNA strands arewound around each other, and the bases interact with each other in base pairs,the sides of the bases are still exposed on the side of the double helix, like thesides of a rope. Therefore, a protein can still, just by looking at the sides of theDNA, sense which bases are present in which order.

1.B.2 DNA as a genetic information carrier

The sequences of bases also create the genetic information in the DNA becausethe order of the bases along the chain specifies protein sequences and proteinsare the workhorses of the cell. It also specifies other genetic functions.

The two strands have complementary shapes and charges, thus the twostrands have complementary information as well, where the information is thesequence of the bases. Because of this, it’s possible to do error correction. Ifone strand is wrong, the bases in the other strand provide the information tocorrect the mistakes. This is called informational redundancy and it’s a veryimportant feature of all informational systems. It also allows for copying of theinformation, duplicating of the genetic information, for heredity. And it allowsfor a read out of the information to instruct the cells to produce proteins.

As we have seen in Section 1.B.1, the bases in the DNA molecule are exposedat the sides of the double helix. This is important to convert the physicalproperties into information.

There are two ways that the physical properties of DNA become information.

1. The base sequences of some segments of DNA are recognized by specializedproteins that feel their shapes and control other events.

2. The base sequences of other segments code for proteins. They are tran-scribed and translated into polymers of RNA and protein. The DNAsequence specifies the RNA and protein sequences.

What has to happen first isn’t coding for proteins. Instead, it’s the abilityof specialized DNA binding proteins to feel the shapes of the DNA and bindto particular sequences. That’s illustrated in Figure 1.12. A regulatory proteinis feeling its way along the DNA until it comes to a place in the DNA that

10

Figure 1.12: A regulatory protein, represented by the cloud, feels its way alongthe DNA until it comes to a place in the DNA that has the right sequence ofbases that it’s able to bind to. The green arrow indicates that the protein isgoing to cause something to happen.

has the right sequence of bases that it’s able to bind to. This protein can feelthe sequence of the bases even though the DNA is still base paired in its doublehelix. The big green arrow here indicates that this protein, having found it’sappropriate sequence, is going to cause something to happen. Maybe DNAreplication, maybe some other process.

The base sequences of other segments code for protein. The way they dothat is that they undergo processes called transcription and translation intopolymers of RNA and protein, respectively. The DNA sequence specifies thesequence of these molecules. In Figure 1.13 is a little diagram of the processof transcription. The double helical, double-stranded DNA comes apart. It’sunzipped so that a protein, namely RNA polymerase, represented by the blueoval shape, can make an RNA copy (the blue line) of one strand. The RNAmolecule is drawn as a wavy line because RNA molecules are intrinsically moreflexible, being single-stranded, than DNA molecules.

Figure 1.13: The double stranded DNA (black) is unzipped so that a protein(RNA polymerase, indicated by the blue oval shape) can make an RNA copy ofone strand. The RNA polymer is represented by the blue line.

Summary

We have described DNA as a physical molecule and as an informational carrier.Its properties can be summarized as follows.

• DNA’s physical properties

1. two strands

2. stable backbone of each strand

3. base pairing between strands

4. exposed sides of bases

11

• DNA’s informational properties

1. sites where regulatory proteins act

2. accurate replication

3. coding for proteins

Lecture 1.C How we represent DNA

Outline

Representation is really important for DNA, because we can’t actually see DNAmolecules. This means we’re very dependent on having conventions that every-one agrees on, so we know what we’re talking about when we draw somethingto represent DNA. We will see a number of representations in Section 1.C.1.Usually we represent DNA as a line, but in reality DNA is not stretched out.We will discuss how it is coiled and folded in Section 1.C.2.

1.C.1 Representation vs reality in genetics

Genetics is necessarily, and has always been, about things we can’t see. In clas-sical genetics, when genes were first discovered, they were purely hypothetical.Nobody had any idea of their properties at all. We know a lot more now. Weknow all about the molecules. But they’re still too small for us to see.

We can make very accurate representations of the molecules. In Figure 1.14you can see molecularly correct drawings of DNA molecules. But in fact, suchrepresentations are not very useful. They’re much too detailed, unless you’re aperson who’s physically studying the structure of DNA, which we are not. We’llfind that much simpler representations are much more useful.

Figure 1.14: Molecularly correct drawings of DNA, which are too detailed to beuseful for us.

12

Figure 1.15: A simple representation of the two strands of a DNA molecule,with their direction.

Figure 1.16: A simple representation of the two strands of a DNA molecule,with their direction and vertical lines that indicate the base pairing.

For example, often we will represent DNA and chromosomes by simply draw-ing a line. That’s about as simple a representation as you can get. If we careabout the orientation of the DNA, with its 5 prime and 3 prime ends, we mightuse an arrow to mark the 3 prime end at the DNA.

By convention, when we represent a single strand of DNA, we generallyrepresent it with the 5 prime end on the left, and the 3 prime end on the right.Just as there’s a convention for text, the same sort of convention applies forDNA. If we need to think about both strands, we may draw both strands astwo lines. We may include arrowheads on the lines to remind us of their anti-parallel direction, as in Figure 1.15. If we want to really remind ourselves thatthese lines are DNA held together by base pairing, we can draw vertical linesconnecting the DNA to indicate the base pairs, see Figure 1.16.

Very often, what we care about in DNA is the sequence. So, often we’llwrite the sequence of the DNA as in Figure 1.17. We won’t bother drawing theDNA or making a line at all. We’ll simply represent the sequence. Again, byconvention, we write the sequence of a single strand. Because the informationis redundant, we don’t need to write the information of both strands. We canalways infer the information of the second strand from the information of thefirst strand. We always write the single strand from 5 prime end on the left, 3prime end on the right.

Sometimes there are circumstances when we actually want to consider bothstrands of DNA. For instance, in Module 2, when we’re talking about mutations,this will be important.

As an example of a simple representation of DNA we refer to Figure 1.13,where the transcription of DNA to RNA is shown. A lot of information isconveyed in a very simple set of lines.

Figure 1.17: The sequence of bases in one strand of a DNA molecule.

13

1.C.2 DNA is coiled and folded

There’s another level of representation that’s also extremely inaccurate: wedraw the DNA as if it was stretched out neatly. But it’s not. DNA moleculesare far, far longer than the cells they’re in. Some of the chromosomes in any onecell is about a meter long, whereas the cell is about one hundred-thousandthof a meter across: 1

100,000m = 10−5m = 10−2mm = 10µm. Consequently, theDNA has to be very tightly coiled to fit in the cell. And it’s not just scrunchedtogether at random. It’s carefully packaged.

In Figure 1.18a is what we’d call naked DNA. This is rare in the cell, exceptat places where the DNA is being directly copied. More commonly, almost allthe DNA in the cell is initially compacted by being wound around proteins asin Figure 1.18b. This structure is often described as a beads on a string. Andthen those beads with DNA wound around them are themselves wound aroundeach other to make a thick fibre as in Figure 1.18c. All of the inactive genesand the non-functional parts of the chromosomes, are in this structure, or ineven more compacted forms, as shown in Figure 1.18d, in all of our cells, except

(a) Naked DNA, which is rare. (b) DNA is wound around proteinswhich gives a ‘beads on a string’structure.

(c) The string with beads windsaround itself to produce a thick fi-bre.

(d) Compact form of DNA.

Figure 1.18: Representations of DNA molecules

14

when the cells are replicating.When the cells are actually dividing, the chromosomes need to be made even

smaller, so that the length of the chromosome is less than the width of the cell.Then those chromosomes are compacted even more by taking these fibres andagain, winding them up around each other, which is shown in Figure 1.19.

Figure 1.19: When a cell divides, the chromosomes become even more compact.

Lecture 1.D History of DNA

Outline

We’re going to talk about the history of DNA. We’re going to start in Sec-tion 1.D.1 with DNA replication, the very short-term history of DNA. Andthen, we’re going to draw a parallel between the cell theory in biology (Sec-tion 1.D.2) and what we could call a DNA theory for the continuity of geneticinformation (Section 1.D.3).

1.D.1 DNA replication

In Figure 1.20 you can see a diagram showing DNA replication with many ofthe components shown. This is simpler than the molecularly correct animationsthat you can find on-line, e.g. at http://www.hhmi.org.

In Figure 1.21 an even simpler representation is shown. The curved lines areDNA, two strands of the double helix being replicated here, separating to formsingle strands that can then be copied by the process of DNA replication. What’snot shown here is who’s doing the work. It’s the enzyme DNA polymerase,which is actually a very complex factory of proteins which, if we drew it in,would completely obscure what was happening to the DNA.

DNA polymerase has bound to the DNA and has separated strands and isthen inserting new subunits in the single-stranded region, elongating the newDNA strands, adding new bases, each base’s identity determined by base pairing.

15

Figure 1.20: Schematic representation of DNA replication.

Figure 1.21: Representation of DNA replication without DNA Polymerase

16

Figure 1.22: Simple representation of DNA replication. The genetic informa-tion in the two double-stranded daughter molecules is identical to the geneticinformation in the double-stranded parent molecule.

Because of the polarity of DNA, the fact that the strands are anti-parallel, onenew strand is being made in this one direction and the other new strand is madein the opposite direction.

In Figure 1.22 I just show the two strands of DNA, base paired and comingapart under the influence of DNA polymerase (the grey shape). DNA poly-merase is then synthesizing new DNA by inserting new subunits, checking thebase pair for complementarity to the existing base. The same thing is happeningon the other side, but in the opposite direction.

The most important thing about this process is that it’s directed by basepairing. The base that’s inserted in the top strand, the new, red T base, ispaired with the blue A base. It is identical to the blue base that’s present in thebottom strand at the same position. The consequence of this for the relationshipbetween the original parent molecule and the two new double-stranded DNAmolecules, which we call the daughter molecules, is that the genetic informationin the two daughter molecules is identical to the genetic information in theparent molecule. Each daughter molecule consists of one new strand and oneold strand physically, but genetically, they are both identical to the originaldouble-stranded DNA.

1.D.2 The Cell Theory in biology

Let us now think about this in the bigger context. For biologists, the biggercontext is the evolutionary continuity of life ever since the origin of life. In Fig-ure 1.23 is the big tree of life that shows the relationships of all of the organismsthat are alive now, all the bacteria, the bacteria-like cells called archaea, and theeukaryotes, which include all the plants and animals. All of these organisms are

17

Figure 1.23: The tree of life

descended from earlier organisms that, in turn, are descended from a commonancestor that’s the ancestor of all life.

This picture is supplemented by something called the cell theory. It is atheory that was developed in around the 1850s when good microscopes becameavailable and it was possible to see how cells divided. At that time, it wasrealized and popularized, by a German named Virchow, that all cells came fromexisting cells by cell division. Our bodies do not create new cells from scratchwhen they want them. Instead, all of the cells in our bodies arise by cell divisionfrom cells that are already present, all the way back to the original fertilizedegg that formed when the egg and sperm fused.

This means that not just life, but that cells can be traced all the way back,that every cell of every bacterium and every cell of every archaean and of everyeukaryote is a direct lineal descendant of cells that were present earlier in evo-lutionary time, all the way back to the very first cell. So, cells aren’t createdfrom scratch, they all arise by division of existing cells.

1.D.3 DNA’s evolutionary continuity

This is also true for DNA. As far as we know, every DNA strand originatesby DNA replication, using a pre-existing strand as a template. Consequently,the tree of life (Figure 1.23) is also a tree showing the evolutionary continuityof DNA. All of the DNA in the bacteria, all the DNA in the archaea, all theDNA in eukaryotes, all the DNA in our bodies are direct lineal descendants ofDNA that was present in earlier organisms, in earlier cells, all the way back tothe very first DNA molecules at the origin of life. And because DNA is alsoan informational molecule, unlike cells, this means that DNA contains a lot ofinformation about its evolutionary history.

18

Summary

We’ve talked about DNA replication, about how each strand in an existingmolecule acts as the template for creating a new strand that’s complementary.The result is that each new double-stranded DNA molecule has one old strandand one new strand physically, but genetically is identical to the parent molecule.DNA, thus, has evolutionary continuity. DNA isn’t created from scratch. EveryDNA molecule arose by replication of existing DNA. This means that DNA’sinformation is, in some ways, a record of evolutionary changes in DNA, goingback to the origin of life. And because DNA is the hereditary material thatspecifies all the properties of life, this makes it an enormously valuable resourcefor understanding our evolutionary history.

Lecture 1.E What makes some DNA sequencesgenes?

Outline

In this lecture, we’re going to talk about what makes some parts of our genome,some DNA sequences within our genome, genes, which function in producingeverything we need to be the organisms we are, whereas the rest of our genome isjust basically inert. We’ll talk about RNA and proteins very briefly, what thesemolecules are, in Section 1.E.1. And we’ll talk about genes as informationalentities in DNA (Section 1.E.2) and the informational processes by which theDNA sequence in a gene becomes usable information that results in a functionalprotein (Section 1.E.3).

1.E.1 RNA and protein

RNA (ribonucleic acid) is, like DNA (deoxyribonucleic acid), a nucleic acid. Itis made of subunits that are very similar to the subunits in DNA. The backboneis slightly different. The backbone of DNA is deoxyribose. The backbone ofRNA is a molecule called ribose. As the name indicates, the only difference isthat there’s an oxygen on ribose that isn’t in deoxyribose.

One other difference between DNA and RNA is that RNA uses one differentbase. It still uses A and G and C, but instead of T, it uses the base uracil, a U.So, at positions that in DNA would be a T, the corresponding position in RNAwould be a U. The base U, like T, pairs with A.

RNA is not usually base paired, it is single stranded. However, it getstransiently base paired with DNA when it’s being synthesized. It can also foldup on itself: different parts of an RNA molecule can base pair with each otherif the bases are complementary. It sometimes forms transient base pairs withother RNA molecules, in particular, with short segments in the transfer RNAmolecules that actually decode the sequence for protein synthesis.

In Figure 1.24 you can see a comparison of the DNA and RNA molecules.

19

Figure 1.24: A comparison of the RNA and DNA molecule.

Protein is a very different kind of molecule. A drawing is given in Figure 1.25.It’s not a nucleic acid at all. It’s still a polymer. It’s an informational polymer.It consists of subunits that are similar in their ability to form a chain but havedifferent properties. These subunits are called amino acids.

Figure 1.25: A protein molecule is a polymer of amino acids.

20

Here is an overview of the properties of RNA and protein.

RNA:

• a nucleic acid, like DNA,

• slightly different backbone than DNA (ribose, not deoxyribose),

• U bases where the corresponding DNA bases are T bases,

• usually single stranded but can fold up by internal base pairing.

Protein:

• not a nucleic acid,

• a polymer of amino acids.

We’ll talk a lot more about proteins in Module 3. For now, all you need toknow is that the proteins are the enzymes and the structures, almost all of theworking parts of the cell.

1.E.2 Genes are information in DNA

Most of our genome isn’t genes. Genes are only a small subset of our DNA. Ifyou just looked at our DNA, any segment of your DNA, you’d have no way totell that it did or didn’t encode a gene. If you look at a piece of DNA, it justlooks like DNA. Unless you examine the sequence and analyse the sequence, youcouldn’t tell this was a gene. Genes are informational entities, not physicallydistinct from the rest of the genome.

Their informational properties that make them genes are, first, that theyhave signals called the promoter and the terminator, which are short se-quences that direct proteins to carry out the process of transcription, to makean RNA copy of this part of the DNA. Typically, they also have, within thesequence between the promoter and terminator, sequences that specify specificamino acids for translation, for the formation of protein, including signal se-quences that say, make this into a protein.

By far the majority of genes encode for protein. Some genes, however, encodefunctional RNAs. These are RNAs that don’t serve to make protein, but arefunctional in their own right. Most of these are enzymatic components of theprotein synthesis machinery, the ribosome. Or they are the adapter moleculesthat decode the RNA sequence and connect it to the amino acids that are goingto be inserted.

21

1.E.3 DNA information becomes protein information bytranscription and translation

In Figure 1.26 is a diagram to give you some sense of how the regulatory signalsequences, the promoter and terminator, act. Each of these signals is a shortsequence of bases. In the diagram you see two lines representing double strandedDNA, with the cross hatch marks to help remind us that these two lines representDNA. The blue oval represents RNA polymerase. That’s the enzyme that’sgoing to carry out transcription, it’s going to synthesize the RNA using a DNAtemplate. It recognizes two signals on the DNA.

First, it recognizes a sequence that’s effectively a start sequence. It is calledthe promoter. The promoter tells RNA polymerase “start here to make RNA.”RNA polymerase then proceeds along the DNA and as it does, it makes an RNAcopy of the DNA. It stops when it comes to a sequence called the terminator,which is another short regulatory sequence that tells RNA polymerase “this isthe place to stop making RNA, to disassociate from the DNA, and to releasethe RNA from the DNA.” So, the RNA is not connected physically to the DNAonce it’s finished.

When we go back to Figure 1.13, we have the two strands of DNA, shownbase paired at the left and right. It shows that in fact the two strands comeapart. They’re unzipped by RNA polymerase, so it can make a complimentaryRNA to one strand.

This also lets me point out that the promoter has a second function. Inaddition to telling RNA polymerase where to start on the DNA, it also tells itwhich direction to go. It does this basically by telling RNA polymerase whichstrand to use. If RNA polymerase uses the bottom strand in the drawing, ithas to go to the right because RNA is synthesized from its five prime end toits three prime end on a DNA strand that runs three prime to five prime. IfRNA polymerase were to use the top strand, it would have to be going left.Consequently, the function of the promoter is not just to say start here, but toalso say, either start here and use the bottom strand so you’re going right, orstart here and use the top strand so you’re going left.

Let’s add on to this process the signals that control protein synthesis. Wenow have two more signals another start signal, the start codon, and another

Figure 1.26: DNA is transcribed into RNA by RNA polymerase. RNA poly-merase recognizes two signals on the DNA: the start sequence, called promoter,and the terminator, where it has to stop the synthesis.

22

Figure 1.27: mRNA is translated in protein.

stop signal, the stop codon, as shown in Figure 1.27.Each protein coding gene has two ‘start here’ signals. It has the promoter,

which is a signal in the DNA. And it has the start codon, which is a signalin the RNA, but in fact that is coded also in the DNA. It’s recognized in theRNA by the ribosome, the protein and RNA factory that will synthesize theprotein. There are also two kinds of stop signals. There’s the terminator thatwe already introduced, that tells RNA polymerase where to stop. And there’s asignal in the RNA: the stop codon. It acts in the RNA and is recognized by theribosome. The stop codon tells the ribosome to stop here, stop making protein.Again, this sequence is specified in the DNA, but it acts in the RNA. Whathappens is this:

1. The ribosome binds to the start codon,

2. it proceeds along the messenger RNA from its five prime end to its threeprime end,

3. it assembles amino acids into the polymer of a protein,

4. it stops when it reaches the stop codon.

The order of the bases within the DNA determines of course the order of thebases in the messenger RNA, and that determines the order of the amino acidsin the protein.

Summary

Genes are informational entities in our DNA. Sometimes they specify functionalRNAs, like parts of the ribosome, but most of them specify messenger RNAs thatcode for proteins. Genes are identified by the cell by the presence of regulatorysignals, short sequences in the DNA, that tell RNA polymerase to make an RNAcopy and short sequences that act in the RNA to tell the ribosome where tostart and stop to make protein.

Lecture 1.F Coding for proteins

Outline

In this lecture we’re going to be talking about how the information in DNAis used to code for proteins. Coding is really very much the right word for

23

what happens. We’ll talk in Section 1.F.1 about how messenger RNA must bedecoded to a different language in protein synthesis from the language of thebases, the nucleotide subunits of DNA, into the language of amino acids, thesubunits of protein.

The genetic code isn’t the DNA sequence. The genetic code is the codebookthat explains how the translation from one language to another is going tohappen, just like any other codebook or a foreign language dictionary. This isdiscussed in Section 1.F.2.

The actual translation is done by molecules called transfer RNAs which isexplained in Section 1.F.3.

In Section 1.F.4, finally, we will discuss the key concept of reading frames.

1.F.1 mRNA must be decoded in protein synthesis

Figure 1.28: Part of an mRNA split into codons and the amino acids theyspecify.

We’ve said several times that the order of the bases in the messenger RNAspecifies which amino acids are going to be joined to make the protein. Thegenetic code is the specification, the connection between bases and amino acids.When translating, the messenger RNA is read in groups of three bases, calledcodons. In Figure 1.28 we have part of a messenger RNA. It has been split intogroups of three bases which are the codons. Each of these codons corresponds toa particular amino acid. AUG with methionine, ACG with threonine, et cetera.

Remember, we’ve used the term codon before, when we talked about startcodons and stop codons. They are also a group of three bases. The start codonis equal to AUG. Consequently, the start codon always specifies the amino acid

24

methionine. There are three stop codons: UAA, UAG and UGA. They don’tspecify any amino acid at all, and that’s why translation stops at a stop codon.

1.F.2 The ‘Genetic Code’ is the codebook

Figure 1.29: The codebook for the translation of mRNA codons into aminoacids

In Figure 1.29 is an example of a typical genetic code table. It explains forwhich amino acid each codon codes. The third row in the table, for example,shows all the codons that start with A. The second column shows all the codonswhose second letter is C. The (3, 2) box is all the codons that start with AC andany of the four third position bases (A,C,G or U) all specify the same aminoacid, threonine. It is not always the case that the base in the third positiondoesn’t matter, but it is in a number of cases.

1.F.3 Transfer RNAs translate the code

Cells don’t use a genetic code table. The table in Figure 1.29 is just some-thing that geneticists have come up with. In the cell, the interpretation of thecode is done by transfer RNA molecules. In Figure 1.30 is a typical transferRNA: transfer RNA glutamate, whose name we’ll write as tRNAGlu. It has aglutamate amino acid attached to one end.

Remember that RNA molecules, such as tRNAGlu, can fold up and differentparts of them can form base pairs with other parts. That’s how what starts asa stretched out linear RNA folds into this complicated structure of Figure 1.30.The key feature of the structure, apart from its particular amino acid, is thepresence of a set of three unpaired bases called the anticodon which are com-plementary to the codon for glutamate.

25

Figure 1.30: Transfer RNA glutamate (tRNAGlu) has the amino acid glutamateattached to one end and the anticodon CUC at the other end. The anticodonCUC is complementary to the codon for glutamate, namely GAG.

The transfer RNA brings the glutamate to the messenger RNA inside thecomplicated structure that I’ve been referring to as the ribosome, which is thebig protein synthesis factory. And in that factory, glutamate will be added tothe growing chain of amino acids specified by the chain of bases, the messengerRNA.

In Figure 1.31 is a second drawing showing a simplified transfer RNA. Foreach amino acid, there’s a different transfer RNA with the appropriate anti-codon. The first one is the methionine transfer RNA with its anticodon, thenwe have the threonine transfer RNA bringing threonine to base pair with thecomplement of the three bases UGC.

This is how the genetic code is used to translate the base sequence of a

Figure 1.31: The transfer RNAs that base pair with the messenger RNA codonsand bring the amino acids to form the protein.

26

messenger RNA from the base sequence of a gene into the amino acid sequenceof a protein.

1.F.4 Reading frames

A key concept for thinking about how genes specify proteins is that of read-ing frames. In principle, in a DNA sequence, every three base sequence is apotential codon.

Figure 1.32: A DNA sequence has been marked off in groups of three bases.

In Figure 1.32 I’ve marked off the DNA sequence in groups of three whichare all potential codons if this happened to be a gene. Every ATG is a potentialstart codon and every TAA, TAG or TGA is a potential stop codon. But in fact,of course, there are three ways to read this sequence depending at which baseyou start. You could be reading in one of the three potential reading framesshown in Figure 1.33.

Figure 1.33: Three potential reading frames in one direction

Going in the other direction on the bottom DNA strand, we have anotherthree reading frames in the other direction. So, in total there are six possiblereading frames, as shown in Figure 1.34.

I’ll show you in Module 2 how geneticists analyse reading frames, but the celldoesn’t get confused by this and that’s because the cell doesn’t translate DNA,the cell translates messenger RNA. The only things that are considered by theribosome are sequences that have a promoter so that they can be transcribed,and then they are potentially translatable.

In Figure 1.35 is a messenger RNA that’s been transcribed from the DNAsequence in Figures 1.32 and 1.33. This already eliminates three reading frames

Figure 1.34: There are six possible reading frames, three in each direction.

27

Figure 1.35: A messenger RNA sequence, transcribed from its DNA.

from consideration because we’ve only got one strand which means that theribosome will have to move to the right (from 5’ to 3’). Once a ribosomeencounters an RNA that could be a messenger RNA, what it looks for is a shortsequence called a ribosome binding site, which is always very close to the startcodon of the gene and the ribosome binds at the ribosome binding site. It thenmoves along the RNA until it encounters the first AUG and that AUG sets thereading frame. From that AUG on, the sequence is read in groups of three.

In these groups of three, any other AUGs are just treated as methioninecodons. They’re not treated as start points at all. Furthermore, stop codonsare recognized only if they’re in the same reading frame as the AUG that startedsynthesis. Any stop codons that are out-of-frame are ignored. In fact, any out-of-frame combinations of any kind are ignored by the ribosome. It only sees thereading frame that’s set by the AUG that it started with.

Summary

We’ve talked very much in the language of information. We’ve talked aboutcoding, we’ve talked about reading, we’ve talked about translating codons orwords. So really, literally, genes do encode proteins going from the language ofnucleotides to the language of amino acids. The code is read in words that aregroups of three bases. It’s as if in our language all the words were three letterslong.

The genetic code is the codebook that translates the language. We’ll talklater about how geneticists use the codebook, but to a cell, the codebook isphysically instantiated in the transfer RNA molecules that bring the aminoacids to the codons of the messenger RNA in the ribosome.

We’ve talked about reading frames and how although there are many poten-tial reading frames, the ribosome knows which reading frame to consider becauseit’s only looking at a messenger RNA not both strands of a gene, and it usesa ribosome binding site. The first start codon, the first AUG in the messengerRNA tells it where to start. This sets the reading frame and determines howthe messenger RNA will be translated.

28

Lecture 1.G More about genes

Outline

In this lecture, we will talk about some features of genes that we haven’t dis-cussed yet. In particular we’ll talk in Section 1.G.1 about the very peculiarphenomenon of introns and splicing, and we’ll talk in Section 1.G.2 about howcells and particularly how geneticists identify genes.

1.G.1 Introns and splicing

For me this is an embarrassing topic, because it so strongly contradicts mypleasure in the beauty and elegance of molecular biology, because this is a messyand apparently unnecessary phenomenon that just makes life more complicatedfor cells.

Most protein-coding genes include segments that aren’t protein-coding atall. They’re called introns, and they have to be cut out – spliced out is theterm we use – of the RNA before it’s a functional messenger RNA and it is readyto be translated into protein. The segments that are kept are called exons.

Figure 1.36: A typical gene with exons and introns.

In Figure 1.36 is a diagram of a fairly typical gene. The line representsthe DNA molecule. You can see the regulatory signals for transcription: thegreen arrow is the promoter and the orange rectangle is the terminator. What’stranscribed into RNA is also indicated, but only some of this codes for protein.In particular, there are long segments called introns that don’t code for proteinat all, and they have to be cut out and discarded. There are segments that docode for proteins (the exons), which are typically shorter, and they have to bejoined together when the introns are cut out.

There are also two segments at the beginning and the end that usually arenot translated, but are a part of the mature messenger RNA: there’s a shortsegment at the beginning which usually includes the ribosome binding site beforethe start codon, and a short segment at the end which will include the stop codonand a little bit of sequence after it that is not going to be translated. They aredrawn in red in Figure 1.36.

29

Figure 1.37: How splicing happens.

In Figure 1.37 is a schematic of how splicing happens. We have a gene witha promoter (blue arrow) and a terminator (red lines). The RNA includes seg-ments that are going to be removed, the introns. The introns are recognized bythe splicing machinery because of particular regulatory sequences, recognitionsequences at the beginning and end, the junction points or splice points, of theintrons, indicated by the green short lines. This allows the cell to recognize theplaces where it needs to cut out sequences.

The sequences that code for protein, the exons, are joined together into amature messenger RNA. And the intervening sequences are discarded.

In the context of this course, you don’t need to know much about introns atall. You need to know that they exist. They’re very important for understandingthe genome. But we’re not going to discuss them in any detail, and you don’tneed to know anything about how splicing works.

In Figure 1.38 is the structure of a generic gene. This particular generic genehas only got a single intron, but it’s got all of the regulatory sequences that youneed to think about. The regulatory sequences that were transcription factorstell RNA polymerase where to look for a promoter. We see the promoter, thestart codon, the ribosome binding site, codons and intron with its regulatoryjunctions, stop codon, transcriptional terminator site.

Natural selection has acted on all of these sequences to optimize their func-tion. It’s acted on coding sequences to optimize the combination of amino acids,even the order that the amino acids are in the chain to give the best function forthe protein. It’s acted on the introns, on the splice junctions for correct exci-sion, it’s acted on all of the regulatory sequences, the strength of the promoter,which transcription factors bind, to optimize when the gene is expressed, andhow strongly it’s expressed.

Figure 1.38: The structure of a generic gene

30

1.G.2 How cells and geneticists identify genes

How cells identify genes

We’ve already talked about how cells identify genes. Here is an overview. (Seealso Figure 1.38).

1. Regulatory proteins recognize and bind to DNA near the promoter.

2. RNA polymerase binds at the promoter and initiates transcription.

3. RNA polymerase makes an RNA version of one strand of the DNA.

4. RNA polymerase and other proteins recognize a terminator sequence andreleases the new RNA.

5. Introns are spliced out due to recognition sequences in the intron.

6. The ribosome binding site and start codon direct translation.

How geneticists identify genes

How cells identify genes is nothing like how geneticists identify genes. We useways that seem more sophisticated to us, but really the cell is much better atit than we are.

Geneticists really have two fundamentally different ways of identifying genes:

1. genetic analysis,

2. sequence analysis.

Genetic analysis started with the very first geneticist, Gregor Mendel,who used crosses between pea plants with different phenotypes to investigatethe patterns of inheritance. What did the progeny look like? How many of eachkind where there? From that he inferred the existence of genes controlling theproperties he was studying. Many, many geneticists in the 150 years since thenhave used the same strategy to identify a great deal about what genes there areand how they work.

Sequence analysis uses DNA sequences, to which we now have access inenormous quantities relatively cheaply. That means that it’s possible to do a lotof analysis without any crosses. In many cases, we supplement or complementthe genetic analysis, the crosses, with sequence analysis where we use computersto analyse DNA sequences and to identify the genes that they encode.

To think about this, we need to first build on the concept of reading framesthat we developed in the last lecture. We talked about how there are six readingframes in any double stranded DNA, three frames going to the right, threeframes going to the left. And we can mark off the three base codons in threedifferent ways, depending on where we start. To a computer, to a moleculargeneticist, an open reading frame is a section of the DNA that starts withan ATG or an AUG, depending on whether we’re looking at a DNA sequence

31

Figure 1.39: A DNA sequence with all possible reading frames. The yellow areais an open reading frame in one of them. It starts with the start codon ATGand ends with the first stop codon in the reading frame, TAG.

or an RNA sequence. So the segment starts at a start codon and read in threesends with a stop codon that’s in the same reading frame.

The start point is set by the first start codon and the open reading frameextends all the way to the first stop codon that’s actually in frame. This isillustrated in Figure 1.39, where the yellow area is an open reading frame.

Finding open reading frames is only the very first step in finding genes witha computer. Many open reading frames are not genes at all, so we use additionalfeatures to decide if we’re looking at something that might actually be a gene.

Figure 1.40: A DNA sequence translated by a computer in all its possible readingframes, written as amino acids.

In Figure 1.40 is an example of a DNA sequence that has been translatedby a computer in all of its possible reading frames, written as amino acids. Thethree reading frames at the top are going to the right, the three reading framesbelow the DNA sequence are going to the left. In each case, the open readingframes have been marked off by blue highlighting.

For example, at the bottom line in Figure 1.40 there is a short open readingframe - L C L I M read from right to left, from methionine to a stop codon,indicated by the dash. On the second line is a relatively long open readingframe M I S R G K E S Y K M S D K L K G N N Y E S D. Going from leftto right, it starts from methionine and we don’t know where it ends.

First we look for long open reading frames. Usually we set the computer toask that they be greater than 50 or 100 amino acids. Then the computer checksfor sequences that resemble promoter and terminator sequences and checks forsimilarity to known genes. Because of the way genes arise, which I’m going todiscuss in one of the following lectures, looking for similarity to known genes isa very powerful way to identify genes.

32

Summary

We’ve talked about how gene expression is controlled by regulatory sequences.In particular, we talked about introns and how regulatory sequences at thejunctions allow them to be spliced out of messenger RNA and discarded. Wetalked about how real and potential genes are recognized by the cell, and inparticular we talked about how geneticists recognize genes, both in organismsthrough crosses and in DNA sequences through sequence analysis.

Lecture 1.H What makes these processes so con-fusing?

Outline

This lecture is made in recognition of the fact that the three processes we’vebeen talking about – DNA replication, transcription, and translation – are re-ally intrinsically very confusing, and you have to work hard to keep everythingstraight. This lecture is an attempt to clarify things by pointing out what arethe common features of all three processes that make them so easy to con-fuse, and what are the distinguishing features between them which make it soimportant to keep them straight?

We will start in Section 1.H.1 with common features and distinct functionsof DNA replication, transcription and translation. In Section 1.H.2 I will givea text analogy and in Section 1.H.3 we will show that the processes are hard todistinguish because they are so very similar.

1.H.1 Common features of DNA replication, transcriptionand translation and their different functions

The three processes we are talking about have many parallels and confusinglysimilar terminologies, but very different functions.

DNA replication DNA info −→ DNA info

transcription DNA info −→ RNA info

translation RNA info −→ protein info

Table 1.1: An overview of the main difference between three important processesin genetics: DNA replication, transcription and translation.

In Table 1.1 I give an overview of the three processes. DNA replicationmakes a DNA copy of the DNA. Transcription takes the DNA information andturns it into an RNA version. There are still nucleotides, just slightly differentones, without an extra oxygen and with a U instead of a T. Translation takesthe information in RNA, turns it into information in protein.

33

Process Product Why?

DNA replication DNA copy of the DNA To make more cells. DNAreplication is the central act ofheredity. It is here that muta-tions happen.

transcription RNA version of a DNAsegment

RNA is the intermediate poly-mer between DNA and pro-teins. Transcription regulatesmost gene activity.

translation protein specified by anRNA segment

Proteins are the cell’s machin-ery that does everything.

Table 1.2: Overview of the three processes, their product and reason why theyhappen

Why do these things matter? This is described in the third column of Ta-ble 1.2.

• DNA replication matters to make more cells. This is how heredity works.

• Transcription produces the intermediate RNA. For most genes, this is thelimiting step that determines whether they get expressed or not. So thisis where most of the regulation happens.

• Proteins, the products of translation are the machines that get just abouteverything done in the cell and thus, in our bodies.

1.H.2 A text analogy

I’ve made a text analogy of the three processes just to sort of provide one moredistinguishing perspective on it. We use the following sentence:

...startstartthefatcatatethebigbadratstopstop...

It says “the fat cat ate the big bad rat.” And it’s flanked by two signals forstarting and two signals for stopping. These signals control how the informationis used.

DNA replication: In DNA replication none of the signals matter. DNA poly-merase does not read any information from the DNA that it’s copying. Itjust copies it accurately. DNA replication makes a complete copy of thewhole string of text that we started with. The result is

...startstartthefatcatatethebigbadratstopstop...

34

Transcription: Transcription reads the initial signals – the first start signaland the second stop signal. and it makes a copy of the information inbetween these two signals in another nucleotide language. It’s a variantof the DNA language, like a dialect. The result is

...startthefatcatatethebigbadratstop...

Translation: Translation takes this sequence, uses the “start here” signal andthe “stop here” signal, and takes the information in between and trans-lates it into a completely different language. In this case, I used GoogleTranslate to translate it into Korean.

1.H.3 Why is it so hard to keep them straight?

Figure 1.41: DNA replication, transcription and translation are very similar.They all start from a polymer of nucleic acids with start and stop sequences. Amolecular machine moves along the template, reads the bases and synthesizes anew polymer.

Figure 1.41 shows why it is so hard to keep the three processes straight. Itis because they are very similar.

All three processes, DNA replication, transcription and translation startfrom a template that is a polymer of nucleic acids. It is the bottom blue line inFigure 1.41. The template has some sequences that say start here, and it hassome sequences that say, here’s where you stop.

There’s a molecular machine in every case – in transcription, translation, andDNA replication – which is usually a complex of a large number of highly so-phisticated proteins, and sometimes with RNA, as well. This molecular machineis going to move along the template, and it’s going to read the base sequencesin the template and use that information to synthesize a new polymer whosesequence of bases or amino acids is specified by the template.

The product is in each case a polymer, indicated by the top blue line. Ifwe are looking at DNA replication, the product is DNA, if it’s transcription,the product is RNA, if it’s translation, the product is a protein. The poly-mer’s sequence was determined by the sequence of the template. When the

35

molecular machine that carries out this process then reaches the stop signal, itdisassociates, and it releases the completed product.

Lecture 1.I What is a chromosome?

Outline

We’re going to talk about the fundamentals of what chromosomes are. We’lltalk about both their physical structure (Section 1.I.1) and their informationalcontent (Section 1.I.2). We’ll talk about the regulatory signals that a DNAmolecule needs if it’s to function as a chromosome in Section 1.I.3. In Sec-tion 1.I.4 we will briefly discuss human chromosomes. We’ll talk about thedifferent terms that we use for these in Section 1.I.5. And in Section 1.I.6 we’lltalk about how we can represent chromosomes.

1.I.1 One very long DNA molecule

Structurally a chromosome is one very long molecule of DNA. Human chromo-some X is 150 million base pairs long, very long. And the DNA isn’t naked inthe cell. The DNA is bound to and wrapped around proteins, as we describedin Section 1.C.2.

1.I.2 Information for 100s or 1000s of genes

Informationally, a chromosome is one very long DNA sequence. Embedded inthis sequence are sequences that specify genes and other functions. They’reembedded in a background of non-functional DNA sequences.

1.I.3 Regulatory signals

If a molecule of DNA is going to function as a chromosome, it has to haveparticular properties. It has to carry specific information. First, it has to havesignals that are recognized by DNA replication proteins. These signals are calledorigins of DNA replication. There are usually multiple origins of replicationalong the length of a chromosome. In Figure 1.42 you can see a chromosomewith multiple origins, represented by the grey dots.

Figure 1.42: A chromosome with multiple origins, represented by the grey dots.The blue patches at the ends are the telomeres and the green spot is the cen-tromere of the chromosome.

36

Chromosomes also need special sequences for where DNA replication ends.At the ends of the chromosomes there are special sequences called telomeres.They are indicated in Figure 1.42 by the blue patches at the two ends of thechromosome. These sequences exist because the ends of DNA molecules areharder to replicate than the internal parts of DNA molecules.

Chromosomes also have to have special attachment point sequences, calledcentromeres, one to each chromosome, located at a particular place on thechromosome. In Figure 1.42 the centromere is coloured green. It is the placethat fibres attach to when the cell is going to divide to pull the chromosomesapart.

Finally, of course, chromosomes have to have genes. We’ll talk about thegenes on chromosomes in the next lecture.

1.I.4 Human chromosomes

Figure 1.43: The 24 human chromosomes.

Human chromosomes (see Figure 1.43), luckily, are fairly typical. We’refairly ordinary animals genetically. If you’re female, you have 23 different chro-mosomes. If you’re male, you actually have 24. Of each of these chromosomes,we have two versions, except for the X and Y chromosome, if you’re male.

Each chromosome has got between 50 and 250 million base pairs of DNA.That’s how long it is. And in that DNA sequence is the information for betweenabout 400 to 4,000 genes, depending on the chromosome. The longer chromo-somes have more genes on average. Each chromosome has different genes, notjust different versions, but completely different genes.

Normal organisms have modest numbers, between, say, 5 and 50 or so chro-mosomes. There are some organisms that have all their genes spread over only

37

a very few chromosomes. Some organisms will have their genes spread out overhundreds of tiny chromosomes. These organisms generally have similar amountsof DNA to us. It’s just that they spread their DNA out in more pieces.

1.I.5 Two terms for genes

I have to introduce now some very important terminology that it will be criticalthat you be able to use clearly. There are two alternative words that geneticistsuse instead of the word gene: locus and allele and we need these terms becausegenes come in different versions. The word allele refers to one of the versionsof a gene or a version of a DNA sequence, even if it’s not a gene, whereas locusrefers to the location, the position where the gene occurs, the sequence thatencodes the gene on a chromosome. The term locus is used when you want todiscuss the gene as a general thing and to include in your discussion all of theversions of the gene this implies. So we could talk about the locus that codesfor the ability to taste a particular chemical, for instance. And then we mightsay that this locus encodes a protein that comes in different versions. This locushas different alleles, different versions of the DNA sequence. So, we use ‘allele’where we refer to version and locus where we refer to the general gene in all itsversions.

Here is an overview of the three terms: gene, locus and allele:

LOCUS: the location of a gene or other DNA sequence on a chromosome(refers to any/all alleles of that gene)

ALLELE: a non-identical version of a gene or, more generally, of a DNAsequence

GENE: usually, a segment of DNA specifying a protein or functional RNA(often used where ‘locus’ or ‘allele’ would be clearer)

1.I.6 How we can represent chromosomes

You’ve already seen several representations of chromosomes and I haven’t reallyexplained them very much at all. So I’m going to talk quite a bit about this inthe next couple of pages.

First, in Figure 1.44 is a representation showing what chromosomes actuallylook like under a microscope in a cell that was sort of frozen, stopped in themiddle of cell division.

The chromosomes aren’t neatly arranged like this and numbered in the cell.They’re all spread out and in a bit of a mess. What the cytogeneticist whotook this photograph has done is actually cut out each chromosome from thepicture, and then rearranged them on a piece of paper so that the two versions

38

Figure 1.44: The 24 human chromosomes as seen through a microscope.

of chromosome 1 are together, the two versions of chromosome 2 are together,et cetera. Figure 1.44 shows the chromosomes of a male because there is oneversion of the X and one version of the Y chromosome.

You notice that the chromosomes are not very informative. They just looklike little blobs and they’re darker in some bits than others. These darker andlighter parts reflect different properties of binding to a dye that’s used to colorthe chromosomes under the microscope. And the dark and light parts havebecome to be used as landmarks.

Chromosomes are often represented by diagrams that show the dark andlight ends as landmarks along the length of the DNA, as in Figure 1.43 andFigure 1.45. It’s important to realize that these dark and light bands do notrepresent genes. There are many more genes than there are dark and lightbands. They’re just staining patterns that serve as landmarks in the same waythat the position of the centromere serves as a landmark.

Figure 1.45: The dark and light bands in do not represent genes. They arestaining patterns and serve as landmarks.

39

Teachers and students often represent chromosomes as sort of fat, blobby Xsor skinny butterflies, as in Figure 1.46. That’s not how chromosomes look andit’s best that you do not represent them like that because it will actually createconfusion in your mind.

Figure 1.46: Chromosomes are not shaped like blobby Xs.

I’m going to represent chromosomes in quite a few different ways, but they’reall going to be very simple. I might show them as sort of rounded lines like inFigure 1.47, with constrictions showing the locations of the centromeres. It’snot that the length or the position of the centromere is that important whenI’m talking about it, but because it differs from chromosome to chromosome,they serve as ways to reinforce the point that these are different chromosomeswith different genetic information. I’ve also drawn them different lengths toemphasize that point.

Figure 1.47: One way in which I will represent chromosomes. To emphasizethat they are different chromosomes, they have different lengths and differentpositions of their centromere.

More simply, I could draw chromosomes just as lines as in Figure 1.48. This

Figure 1.48: These lines represent chromosomes. The dots on the lines indicatethe location of the centromere.

40

Figure 1.49: Another representation of chromosomes. The lines are wiggly toemphasize that they are randomly located around the cell.

is especially likely if I’m just drawing them freehand. I might draw a line and Imight draw a blob on it to indicate that this is the location of the centromere.I might sometimes just draw the lines without the centromere.

I may draw them as wiggly lines (see Figure 1.49), especially if I want toemphasize that they’re not neatly arranged in the cell. But they’re actually sortof randomly located around the cell.

I may draw them as I’ve done in Figure 1.50 where I’m showing all of thechromosomes from a particular person in the same color. Of course, I haven’t

Figure 1.50: Five corresponding chromosomes from two people. The chromo-somes in blue are from one person, the chromosomes in purple from anotherperson. Corresponding chromosomes have the same length.

drawn all 23. I’ve only drawn a small number because that’s all I need tomake the point. The chromosomes from one person are in purple and from theother person in blue. But you’ll notice that the corresponding chromosomesare the same length in the two people. They just differ in their colors. In thiscase, I’m using color to indicate the source of the chromosomes, not the geneticinformation on the chromosomes. Two corresponding chromosomes in differentcolours will have different versions of the same genes, different alleles of thesame loci. On the other hand, two different chromosomes in the same colourwill have completely different genes.

In Figure 1.51 is another representation of chromosomes in a single person.I’m only drawing one chromosome, using different colors again to indicate thatthe two versions came from two different people. The light blue version camefrom his mum and the dark blue version from his dad.

41

Figure 1.51: The two versions of one chromosome in a person: one from hismum and one from his dad.

Summary

We’ve talked about the physical and informational content of chromosomes.We’ve talked about the key regulatory signals that control how chromosomesfunction, that allow DNA to act as chromosomes. It has to have origins, telom-eres, and centromeres. We’ve introduced other new terminology for dealing withthe issues of having different versions of genes. We’ve also talked about howchromosomes can be represented.

Lecture 1.J Genes on Chromosomes

Outline

We’re going to continue our discussion of chromosomes. We will think abouthow the genes are arranged on the chromosomes. In Section 1.J.1 we’re goingto dive into a chromosome, and zoom in until we can resolve a single gene, andget a sense of what the gene looks like at a chromosome scale and how muchof the gene codes for protein. And then, in Section 1.J.2 we’ll zoom in on adifferent gene to see how genes are arranged on the chromosomes.

1.J.1 Dive into a chromosome to resolve a single gene

Just to refresh and affirm this very important point from the previous lecture:each of the 23 chromosome types in us, or of the different chromosomes in anyorganism has different genes, not just different versions of genes. The chromo-somes in Figure 1.52 are coloured here to represent that. We have two versionsof the light green chromosome number 1, and two versions of the blue chromo-some 2, and two versions of each of these chromosomes in our cells. And in thewhole population there are many versions of these chromosomes.

In Figure 1.53 is a particular chromosome. It is human chromosome 13. Yousee the landmarks that I pointed out earlier, the centromere , and the specialend sequences, the telomeres, in purple and white.

We’re going to focus in on a particular gene, the BRCA2 gene. This is one ofthe genes that’s been studied intensively because some version of this gene causea greatly elevated risk of breast and ovarian cancer. Gene BRCA2 is located onchromosome 13, and you can see it’s at about 32 million base pairs along theabout 114 million base pairs of chromosome 13. Million base pairs is usuallyabbreviated as Mb (Megabase). We zoom in on part between 30.8M and 32.8M,

42

Figure 1.52: A representation of the human chromosomes. Each one of themhas different genes.

Figure 1.53: Human chromosome 13, which contains the BRCA2 gene. The toppicture gives the whole chromosome. In the bottom picture, we zoom in on thepart where BRCA2 is located.

43

Figure 1.54: Zooming in on the BRCA2 gene in chromosome 13. The yellowsegment at the bottom is the whole gene.

we expand this bit to look at just this segment of about 2Mb of the chromosomeand the BRCA2 gene occupies the yellow space.

When we zoom in even more, we see the BRCA2 gene in Figure 1.54 and theyellow segment at the bottom of the figure is the whole gene. The gene itselfwe see is 90 kilobases, 90,000 base pairs long. We see a schematic drawing ofthe gene: the black rectangles and vertical lines represent the exons. And you’llremember from section 1.G.1 these are the parts of the protein that code forgenes. There are about 25 exons in this gene. But most of the gene aren’t exonsat all. Most of the gene are introns. All of the spaces between the exons areintron sequences, which are cut out and discarded when the mature messengerRNA is synthesized.

1.J.2 How genes are arranged

In Figures 1.53 and 1.54 we saw how one gene looks like. But how are genesarranged on the chromosomes? Well, not the way I would arrange them if I wastidying things up.

In Figure 1.55 is a different chromosome. This is chromosome 20. It’sabout 62 megabases long and we’re blowing up a segment of it that’s about 500kilobases long.

There are 10 genes in this 500,000 base pairs of DNA sequence, indicated bythe blue arrows. One thing I want to point out is how the promoter tells RNApolymerase which strand to use, which determines which direction it goes on thechromosome. Well, here’s an example of this. In Figure 1.55, the direction ofthe arrows gives the direction of synthesis. Most of these genes are synthesizedfrom the bottom strand and they’re going from left to right. But two of thegenes are synthesized in the opposite direction, which is to say they use theinformation on the other strand.

The other important thing you’ll notice is that there are a lot of spacesin between the genes. Both the order, the direction in which the genes aretranscribed, and the spacing of the genes seems very random. It’s not neatly

44

Figure 1.55: Human chromosome 20. In the bottom picture we zoom in on asegment of 500 kilobases long, which contains 10 genes, represented by the bluearrows.

organized at all. It doesn’t actually need to be neatly organized to work. That’swhy it’s not neatly organized.

In Figure 1.56 is again a single gene in an open reading frame, taken out ofchromosome 20 in Figure 1.55, where it is seen in the middle. It’s a gene whosefunction isn’t known. It doesn’t have a specific name describing its function. Itjust has a number: C20orf70, which stands for open reading frame number 70 onchromosome 20. And again, in this representation we see the exons representedas boxes and the introns are represented as lines joining the boxes. Only thebox parts code for protein. Only they will be assembled into the final messengerRNA that’s translated into proteins.

Figure 1.56: This is a single gene taken from the middle of chromosome 20(see Figure 1.55). Its function is not known and hence it does not have aspecific name. It is known as open reading frame number 70 on chromosome 20(C20orf70). The exons are represented as boxes and the introns as lines joiningthe boxes.

45

Here’s a question for you to give you a sense of the scale in the chromosome.Chromosome 20 is about 60 million bases long and it’s got about 900 genes.Here C20orf70 a typical gene. The question is: if all the genes are like this one,how much of chromosome 20’s DNA codes for protein?

The right answer is 3%. This comes from a combination of arithmetic andestimation. What you first have to do it is a rough approximation of how muchof this gene is actually coding for protein. As just a ballpark estimate, I’d sayabout 10% of the line is boxes and the rest is introns. How long is the gene?Well, the gene’s probably about 20 kb. That means that there’s approximately2 kb coding for protein. There’s about 900 genes on the chromosome and ifthey’re all like this one as we’re assuming, that means there’s about 1,800 kb,1.8 million base pairs on this chromosome, code for protein. But the chromosomeitself is 60 million base pairs, which is 60,000 kb. That means that about 3%codes for protein.

Summary

We’ve dived into the structure of a single gene on a chromosome:

• we’ve seen that the gene’s mostly introns,

• we’ve looked at the arrangement of genes over a large scale on the chro-mosome, seen that most of the chromosome is actually unused space

But despite mostly being introns and unused space, there are still millions ofbase pairs coding for protein on each of our chromosomes.

Lecture 1.K DNA Sequencing

Outline

This is going to be a very short overview of the marvels of modern DNA se-quencing. We’ll talk in Section 1.K.1 about how the technology differs fromwhat we used to be able to do, and the different things that we can learn usingthis technology in Section 1.K.2.

1.K.1 Modern DNA Sequencing

Back in the good old days, before about the year 2000, DNA sequencing wasdone on pure samples of DNA consisting of many identical copies of a single DNAfragment. The result of that analysis was one DNA sequence that representedthe average sequence over the whole sample. This is shown in Figure 1.57.

Modern DNA sequencing, in contrast, can take a sample that consists of amixture of many different DNA molecules and it can sequence each molecule inthe sample, giving many, many different sequences as what are called “sequencereads”. This is represented in Figure 1.58.

46

Figure 1.57: Old-style DNA sequencing gives one sequence by combining manyidentical DNA molecules

Figure 1.58: Modern DNA sequencing gives many individual sequences, onefrom each molecule in the sample.

Modern DNA sequencing is also an awful lot cheaper than it used to be.In Figure 1.59 is a graph showing the cost of DNA sequencing as it’s changedbetween the year 2001 and the year 2013. It’s compared with Moore’s law forthe decrease in the cost of computer power, which goes down – I think Moore’sLaw has it that the cost of computer power falls by half every six months. Forthe first six or seven years the cost of DNA sequencing decreased dramatically,roughly at the same rate as the cost of computing power. And then startingin about 2007 it plummeted, and now it’s back to still decreasing dramaticallybut only about the rate of Moore’s law. Do you see this becoming very cheap?Well, it’s cheap for science Nothing in science is really cheap.

1.K.2 Applications

I’m going to describe three applications.

Sequencing a genome

The first application is sequencing not just a gene but a whole genome.For an organism with a relatively small genome, we can take a single sample

tube with the entire genome broken into fragments and sequence all the frag-ments. And then by comparing the fragments we can line them up and find theoverlaps that allow us to infer the complete genome sequence from the assembly

47

Figure 1.59: The cost of DNA sequencing from the year 2001 till 2013, comparedwith Moore’s Law

of these short reads. For a larger genome we can do the same thing, but we’llneed to use multiple samples.

Sequencing DNA from a population (one species) or a community (amixture of species

A second application is that rather than sequencing one genome we can sequencea mixture of DNAs from different sources: DNAs from all members of thepopulation or from all of the species in a community.

For instance, you can scrape the bacteria off your teeth and extract the DNAfrom this mixture of bacteria and sequence it. This is a study called metage-nomics, studying the phenomenon of what is called the human microbiome, theecological communities of bacteria that live on us. When these sequences areanalysed, they can be grouped into particular groups that tells you the kinds oforganisms that are present in this sample, even though you don’t have any wayto directly study these organisms.

Measuring RNA abundances in a cell or tissue

The third application of DNA sequencing is measuring RNA abundances in acell or tissue.

We’ve known for many years that different genes are expressed at differentlevels, and there’s different amounts of RNA for different genes in the cell. Butwe didn’t have an easy way to measure this. We still don’t have a way tosequence RNA, but we do have a very easy way to use an enzyme to make DNAcopies of the RNA, and then we use our very efficient DNA sequencing method

48

to sequence all of the “complementary DNA” as it’s called, in the sample whichgives us a measure of the sequences of the RNAs in the original sample. We canthen basically count the amount of RNA sequence that we get for each gene inthe genome, and that tells us how much RNA is present in the cell or tissuefrom each gene.

In Figure 1.60 we see the results of such an analysis. Gene B is expressed ata very high level, while gene D is hardly expressed at all.

Figure 1.60: Amount of RNA present from each gene

Summary

We’ve talked about how the new methods of DNA sequencing can sequencemany DNAs at once, rather than just a single molecule and they can do it farless expensively than the old methods did. This enables very efficient DNAsequencing to be applied to new problems. We can sequence whole genomesmuch more easily, for instance, than we sequenced the human genome when westarted. We can characterize the genetic membership of ecological communitiesjust by sequencing the DNA without actually knowing what the organisms are.And we can use DNA sequencing to measure the abundance of RNA in a sample,which tells us how strongly different genes are expressed in different tissues.

Lecture 1.L Homology in biology and in genes

Outline

We’re going to talk about the very important concept of homology. We’ll talkabout its meaning in biology (Section 1.L.1), and we’ll particularly apply it togenetics. We’ll talk in Section 1.L.2 about how to decide if similarities are dueto homology, or due to other factors, and in Section 1.L.3 we’ll talk about thespecific genetic case of homologous chromosomes.

1.L.1 Homology

I’ll start with a little justification for why we talk about evolution a lot in a ge-netics course. Here is a quote from a noted geneticist, Theodosius Dobzhansky:

Nothing in biology makes sense, except in the light of evolution.

49

This is particularly true for genetics. Heritable genetic variation is what ge-netics is about, but it’s also what evolution is about, since heritable geneticvariation is what makes evolution possible. It allows natural selection. It’swhat natural selection acts on, and it’s what has evolved. All of the features ofthe genetic systems that we study are the products of evolution. So we’re verymuch embedded within an evolutionary world, as is all biology.

Homology is often confused with similarity. That’s because homology is aspecial kind of similarity. Lots of things are similar. Things that are homologousare similar because of shared ancestry.

Homology: Similarity due to shared ancestry.

Figure 1.61: homologous limb bones in vertebrate species

In Figure 1.61 are examples of the limb bones of four different vertebratespecies, starting with the human arm, followed by dog, bird and whale limbs.The bones have been coloured to show up their similarities between the fourspecies. There’s the upper arm bone (light brown), which is a big bone, thenthe two bones in the lower arm, or forearm, coloured red and white, respectively.Then there are the small bones of the wrist (yellow), and the long bones of ourfingers (brown). We see similar arrangements of bones in the limbs of the othervertebrates. This is because the ancestor of all these vertebrates had a limbwith this structure. It is called the tetrapod limb. It is shared ancestry thathas given us these similar features.

Shared ancestry also applies to the genes that are responsible for the de-velopment of these features. For instance, in Figure 1.62 we’re comparing thegenes that control embryonic development in a fruit fly, and in a mouse.

50

Figure 1.62: homologous genes in the fruit fly (Drosophila) and mouse

The genes are coloured according to their function and positioned accordingto their arrangement on the chromosome. We see that the genes are similarlycoloured for similar parts of the embryo, and they’re arranged similarly on thechromosomes. They also – although you don’t see that here – have similarsequences. We now know that genes like this control embryonic development inall animals, and that these genes are similar because they are descended froma common ancestor of all animals, that controlled its development with thesegenes.

We can extend this comparison farther, to look at individual sequences. Hereare the amino acid sequences of two genes, the human Aniridia gene and theeyeless gene of the fruit fly:

Gene Amino acid sequence (single-letter abbreviations)

Aniridia LQRNRTSFTQEQIEALEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREE

eyeless LQRNRTSFTNDQIDSLEKEFERTHYPDVFARERLAGKIGLPEARIQVWFSNRRAKWRREE

When the eyeless gene of fruit flies is defective, the fly doesn’t have anyeyes. The Aniridia gene was studied completely independently in humans. It’sresponsible for a disease called aniridia, which is the hereditary absence of eyes.Once these genes were sequenced and compared, it was astonishing to see howsimilar these sequences are. The red amino acids are the ones that are different.Almost everything else about these sequences is identical. And this level ofsimilarity must be due to common ancestry. This tells us that the ability to

51

Figure 1.63: The Australian echidna (left) and European hedgehog (right) looksimilar because they both have spines, but this is because of convergent evolu-tion, not homology.

develop an eye, although the types of eyes are very different in fruit flies andhumans, is an ancestral feature that was controlled by this gene in the commonancestor of fruit flies and humans.

1.L.2 How to decide if similarities are due to homology

Not all similarity is due to homology. That’s part of why there’s this confusion.Possible causes of similarity are the following.

1. Chance: Similarity can simply be due to chance. This is especially truefor relatively simple features. For example, if we’re comparing just shortsegments of two DNA sequences, we’ll often find short strings of bases thatare the same, simply by chance, with no evolution or ancestry involved.

2. Convergent evolution: Similarity can also arise on independent featuresthat become similar because natural selection is selecting for the samefunction. In Figure 1.63 we see two animals. At the left hand side is theAustralian echidna. Its only close relative is the platypus. It lays eggs.At the right is the hedgehog, a European mammal. Both of these animalseat insects, and both of them have transformed their fur coats into spines.We know that this is not homologous, because we know a great deal aboutthe ancestors of the echidnas, and the ancestors of hedgehogs, and theydon’t have spines. They have independently been selected for spiny coats,as a form of defence.

3. Homology: Similarity can be due to common ancestry, the kind of sim-ilarity that I described in Section 1.L.1.

How do we decide whether similarity is really due to shared ancestry? Theanswer is, it’s not always obvious. But usually we have a lot of other information.The general principles are as follows. If the similarities are

52

1. so strong that they couldn’t have arisen by chance, for instance, the sim-ilarity between the aniridia gene and the eyeless gene that was shown inFigure 1.62,

2. too arbitrary to have arisen by convergent evolution, that is, the similari-ties extend to features that would not be acted on by natural selection,

then we infer that these similarities must have arisen because of divergence froma common ancestor, and that they must be due to homology.

Once we’ve decided that features are homologous, we can use them to tell usmore about evolutionary processes. This is particularly true for DNA sequences.Once we’ve decided that DNA or protein sequences are homologous, there’s alot of other information in there that lets us make evolutionary inferences. Thesimplest is: knowing how similar two homologous sequences are, tells us howrecent their common ancestor was.

Figure 1.64: Homologous sequences from three species. The sequences ofspecies 1 and species 2 differ only at two positions, while the sequences ofspecies 2 and species 3 differ at eight positions. Species 2 has a more recentcommon ancestor with species 1 than with species 3.

In Figure 1.64 are homologous sequences from three different species, andwe see that the first two species differ only at two positions. Therefore, we inferthey had a very recent common ancestor. However, sequence two and sequencethree are much more different. And we infer that sequence three has a moredistant ancestor from sequence two than sequence one does. We’ll talk moreabout this in Module 2.

1.L.3 Homologous chromosomes

In genetics, homology is used in a particular way, in one particular case. It’sused generally in genetics, but in particular, it’s used in the term homologouschromosomes. It’s really the same meaning that homologous has everywhereelse: similarity due to a shared ancestry. In particular, chromosomes that carrydifferent versions of the same information are called homologous chromo-somes.

It also applies to the two versions that we have of each of our chromo-somes. So, our chromosome 7 from mum, and our chromosome 7 from dad

53

(illustrated in Figure 1.65), are referred to as homologous chromosomes. Theyare of course truly homologous, because the reason they’re so similar is becausethey’ve descended from very recent common ancestors, not species way back inevolutionary time, but humans who were our ancestors, maybe only a few thou-sand years ago. They’re very similar, because they share very recent commonancestry. We refer to the different versions of any particular chromosome thathumans have as being homologous chromosomes.

Figure 1.65: Our own homologous chromosomes 7. Chromosome 7 from mumand chromosome 7 from dad are very similar: the same genes are on each, inthe same order; the DNA sequences differ by only 0.1%. The reason is that theyhave descended from a very recent common ancestor.

Summary

We’ve talked about homology, which is, in some ways, a very difficult concept.But definition-wise, it’s very simple: homology is similarity that exists becauseof shared common ancestry. The trickiness can come in deciding where it applies,but you will only be encountering situations where it’s really quite obvious thatit applies.

We also talked about the specific case in genetics of the different versionsof a single chromosome that we have, which are referred to as homologouschromosomes.

Lecture 1.M Life Cycles

Outline

We’re going to talk about life cycles. First, in Section 1.M.1, we’ll talk about thesimple cell cycle and then, in Section 1.M.2 we’ll talk about plant and animallife cycles, both asexual cycles and sexual cycles. This will serve as sort of apreface to prepare us to talk about genetic variation in life cycles which we’lldo in the next lecture.

1.M.1 The cell cycle

In Figure 1.66 you can see growth and reproduction of a cell. Basically, the cellstarts out with one copy of each chromosome. It grows, and when it reachesa sufficient size to reproduce, it replicates its DNA. Then it has two copies ofeach chromosome. The cell grows a bit more, and then divides. This asexual

54

Figure 1.66: Growth and reproduction of a cell.

division is called mitosis. You will learn a lot more about mitosis in Module 7.This division produces two identical daughter cells, just like the original cell.They’ve got the same three chromosomes. These cells can then grow and divide,and depending on the cell type, this can go on indefinitely.

Figure 1.67: A compact representation of the cell cycle.

This is often represented as a cycle as shown in Figure 1.67, which is aconvenient representation because it’s compact, it doesn’t take up much space.But of course, we have to take into account that in reality, this is a processthat’s occurring through time with the same events occurring again and againproducing more and more cells, as in Figure 1.68.

1.M.2 Typical plant/animal life cycles

The reproduction of asexual organisms is built on the same cell division pro-cess as discussed in Section 1.M.1. We start with the simplest starting place:unicellular organisms, organisms where when the cells divide, they come apart,

55

Figure 1.68: A linear representation of the cell cycle where it is more clear thatmore and more cells are produced.

as shown in Figure 1.69. This applies to most bacteria and to many differentkinds of eukaryotes, most algae, many protists. It is exactly the same processI showed before in Figure 1.66, but the emphasis is on the fact that we aregradually generating more and more and more and more progeny cells.

Figure 1.69: The reproduction of a unicellular organism.

Essentially the same process happens with multicellular organisms. Theonly difference is that the progeny cells, the daughter cells, stay together anddifferentiate into specialized tissues to make the mature multicellular organism.

Many plants can reproduce asexually by simple cell division. Not manyanimals can. But the process is very simple. Basically another simple celldivision, mitosis, produces an asexual seed which is genetically identical to theparent plant. The seed then grows into the next generation of plant which isgenetically identical to the original plant. This is shown at the right side inFigure 1.70.

Some plants can reproduce, instead of by seeds, by sending out runners,often called stolons, the botanical name (see the left part of Figure 1.70). Be-cause these runners are also generated by simple cell division and are thereforegenetically identical to the parent plant, they grow into progeny plants that areidentical to the parent.

This process doesn’t apply in humans because we are obligately sexual. Wecan only reproduce sexually. If we don’t have sex, we won’t have any generationsat all. What about sexual reproducing organisms? Well, this applies bothto plants and to animals and to many single celled organisms as well. The

56

Figure 1.70: Two forms of asexual reproduction in multicellular organisms: bya stolon (left) and by an asexual seed (right).

basic process is that there are contributions from two parents. Each parentcontributes one complete set of genes in either a pollen cell (Figure 1.71) or asperm cell (Figure 1.72), and the other parent contributes an egg cell. The twofuse together to produce the fertilized egg that grows into the next generation. Itcan be the egg that develops into the seed which grows into the next generationof plants or the fertilized eggs grow into the next generation of animals.

The most important difference between these two processes (asexual andsexual reproduction) is that in sexual reproduction, the gamete cells, the pollenor sperm and the eggs, are produced by a special cell division called meiosis.You’ll learn a great deal about meiosis in Module 7. Because of meiosis, thefertilized eggs are not genetically identical to the parents. That means thatthe offspring – the plants or the bunnies – can have different combinations ofthe genetic properties of their parents: different leaf patterns, different flowercolors, different coat colors, for example, and many other things.

Figure 1.71: Sexual reproduction of plants: the pollen cell from one parentand the egg of the other parent each contribute a complete set of genes. Thefertilized egg grows into a seed for the next generation.

57

Figure 1.72: Sexual reproduction of animals: the sperm cell from the father andthe egg of the mother each contribute a complete set of genes. The fertilizedegg grows into an animal of the next generation.

Lecture 1.N Ploidy and recombination

Outline

We’re going to talk about ploidy and recombination. Building on the basicunderstanding of life cycles from the previous lecture, we’re going to talk inSection 1.N.1 about how ploidy changes in sexual reproduction. We alternatebetween haploid and diploid cells. And in Section 1.N.2 we’ll talk about howthe recombination, which is part of the process of producing haploid gametesand bringing them back together to diploids, creates new versions of genomesin two ways: by processes called reassortment and crossing over.

1.N.1 Ploidy: sexual reproduction alternates haploid anddiploid cells

First I give you some essential terms. These are terms that describe the geneticconstitution of cells in terms of their chromosome sets.

haploid: one complete set of chromosomes

diploid: two homologous complete sets of chromosomes

N (or n): the number of chromosomes in a set

A haploid cell has one complete set of chromosomes. In Figure 1.73, I’vedrawn a set of five chromosomes as a complete set. Although the chromosomesall have the same colour, they are distinguished by having different lengths.

A diploid cell has two homologous complete sets. In Figure 1.74 I’ve drawntwo sets. You can see that each set has the same number of chromosomes andhomologous chromosomes have the same lengths. And they’re different shadesof the same color to help us remember that they’re just different versions of thesame chromosomes.

58

Figure 1.73: A haploid cell has one complete set of chromosomes.

Figure 1.74: A diploid cell has two homologous complete sets of chromosomes.

A third term that’s very useful is N , sometimes lowercase n, which denotesthe number of chromosomes in a set. So the haploid cell in Figure 1.73 has N =5. The diploid set in Figure 1.74 also has N = 5, the number of chromosomesin one complete set. Alternatively, we could write 2N = 10 to describe thediploid cell. For humans, N = 23. And we are diploid, so normally 2N = 46chromosomes.

One point which is important to make clear, is that the concept of ploidyis not used to distinguish between the amounts of DNA in the cell before andafter DNA replication:

Ploidy does not change during the cell cycle.

When we look at the cell in Figure 1.66, we can say that this cell must behaploid. We can decide that because I’ve drawn it with only three chromosomes(it’s an odd number, so it couldn’t be diploid) and because the three chromo-somes that I’ve drawn are all different lengths. The cell replicates its DNA andends up with six chromosomes, but it’s still a haploid cell because the chromo-somes are not homologous chromosomes, they are only identical copies. Whenthe cell divides, the daughter cells are still haploid.

1.N.2 Recombination creates new versions of genomes, byreassortment and crossing over

There are very important genetic consequences of sexual reproduction. Not onlydoes sexual reproduction alternate us between haploid gametes and diploid cellsfor the rest of our bodies, but the process of generating the haploid gametes,and then fusing them together to get new diploid cells, generates new geneticcombinations from the previous generation.

59

These combinations arise in two ways

1. reassortment: new combinations of chromosome versions (always in fullsets),

2. crossing over: new combinations within each chromosome.

By crossing over we see that there are different versions of one chromosomethan were present in either parent. All of this will be discussed in great detailin Module 7.

We will now consider the inheritance of sets of chromosomes coming fromyour mother and your father as illustrated in Figure 1.75. Again, we’re pretend-ing there are only five chromosomes for simplicity. Your mum has two sets ofchromosomes that she inherited from her parents, a set from her mum and a setfrom her dad. I’ve drawn them as if they were kept separate, but in fact they’remixed together in the cell. When the egg meets the sperm, the chromosomesmix together, and the cell doesn’t keep track of which parent they came from.Then, on the right in Figure 1.75 there are the two sets from your dad, one fromhis mother and one from his father, and again, these two sets will have beenmixed together in his cells.

When your parents produce gametes, the eggs and the sperm, they do so bythe cell division called meiosis. Meiosis takes the two sets of chromosomes in thediploid cell and generates new cells that have only a single set of chromosomes.But because the two sets of chromosomes were all mixed up, the new set thatyou get is usually a mixture of chromosomes from each of your maternal andpaternal grandparents. It’s a complete set, but it’s a new complete set withdifferent combinations than in either parent. And then, when you go on toreproduce, your gametes contain new combinations of the two sets that youinherited, because again the two sets (one from your mother and one from your

Figure 1.75: The inheritance of sets of chromosomes coming from your motherand your father

60

father) are mixed together in your cells. Your body doesn’t keep track of thesource. Your children inherit new combinations of the chromosomes that you gotfrom all four of your grandparents. And then they’ll get another complete set,different combinations of the chromosomes that your partner got from his/hergrandparents. How they came about can be seen in Figure 1.76.

This all is part of reassortment, often called shuffling because it’s likeshuffling the cards in a deck. But there’s a second process called crossing overthat creates even more genetic variation between the generations.

In Figure 1.77 and Figure 1.78 we’re going to consider just a single chromo-some. The two copies of it in your mother and the two copies in your father aregiven at the top of Figure 1.77.

When meiosis produces the gametes from your parents, each gamete containsonly one copy of each chromosome. But it’s not either copy. It’s a copy that’s anew combination of segments from the two parents. What’s happened, literally,is that there have been breaks and joins so that the chromosome that youinherit from your mum, consists of pieces of chromosome from your maternalgrandmother and from your maternal grandfather. The same is true for thechromosome that you inherit from your father. It’s a new combination of theinformation, the different versions he got from his two parents. And again, whenyou have children, your gametes, and consequently your children, will containnew combinations produced by more crossing over.

In the same way the chromosomes that your children inherit from your part-ner will also be new combinations of segments of the homologous chromosomesthat your partner inherited from his or her grandparents. This is illustrated inFigure 1.78.

So you can see, even in following from your parents to your children, there’san enormous amount of new genetic diversity created. Very many new combina-tions of the genetic variation that was present in your grandparents, is present

Figure 1.76: The inheritance of sets of chromosomes coming from your partner’smother and your partner’s father

61

Figure 1.77: Crossing over between the two copies of a chromosome leads toeven more genetic variation between generations. The picture shows how yourchildren get parts of chromosomes from your parents.

Figure 1.78: Crossing over between two copies of a chromosome in the last twomeiosis steps that lead to the chromosome that your children receive from yourpartner.

in your children.

Lecture 1.O Genetic variation in populations

Outline

We’re finally getting back to talking about genetic variation in populations. InSection 1.O.1 we’ll talk about alleles and genotypes in individuals and pop-ulations. We’ll consider kinds of DNA sequence differences in Section 1.O.2,because most of the analysis is built on looking at sequence differences. We’lltalk about how we compare DNA sequences in Section 1.O.3 and we’ll thinkabout allele frequencies and genotype frequencies, specifically in populations

62

in Section 1.O.4. In Section 1.O.5 I’ll introduce some new terms and a newconcept: the concept of a haplotype which is a kind of genotype.

1.O.1 Alleles in individuals and populations

We’ll start by considering a position somewhere in a DNA sequence. The po-sition is shown in yellow in Figure 1.79, where I’m showing both strands withbars indicating the base pairing.

Figure 1.79: A position somewhere in a DNA sequence. Both strands of DNAare shown with bars between them that indicate the base pairing.

But we don’t need to see the base pairing because we know the conventionsthat when we see a single sequence, as in Figure 1.80, of a single strand of DNA,that the left end is the 5’ end, that the complementary strand would have thecomplementary sequence and we don’t need to show any of this.

Let’s now think about genetic variation.

• How many different alleles are possible at this one position? Four allelesare possible because there are four bases. We could have an A or a G ora C or a T.

• How many different alleles could one person have? Well, a single personcould only have two alleles because they only have two versions of thesequence.

• How many different genotypes are possible for one person? If we’re justconsidering they can only have two alleles, e.g. the alleles A and G, thenthere are three different genotypes possible for a person at this position:AA, AG, GG.

I’m going to introduce two new terms which we’ll be using quite a lot inthe future. Considering two alleles, the person could have two copies of theidentical allele. They could have two As, one from mum, one from dad, or twoGs, in which case we would say they were homozygous at this position, thislocus. Either homozygous for A or homozygous for G. Or, they could have two

Figure 1.80: A single strand representation of the same part of DNA moleculeas in Figure 1.79.

63

different alleles (one A and one G), in which case we would say that they wereheterozygous at this position.

Let’s branch out and think about the same position but this time we’rethinking of a population of people. How many different alleles are possible?Well, again, there’s only four bases, so there’s only four alleles possible. Howmany different alleles could the population have? When we had a person, theycould only have two, but the population could certainly have all four alleles.How many different genotypes would then be possible in the population? Well,with four alleles, we can make a lot of combinations. In fact, there are 10different genotypes that would be possible for this location just from the fourdifferent bases in a population: AA, AC, AG, AT, CC, CG, CT, GG, GT, TT.

Now I’ve got a question for you to answer. How many different alleles of agene can a population have?

The right answer is: it depends on how big the gene is and it depends howbig the population is. Because, for a big gene, there are potentially four differentalleles for every base in that gene. If the gene is 1,000 base pairs long, that’smany thousands of alleles. If the population’s very small, the population canonly have two alleles per person. The number of alleles can’t be larger thantwice the size of the population.

1.O.2 Kinds of DNA sequence differences

Let’s think about kinds of DNA sequence differences. We already introducedthe concept of alleles, different versions or variants of a gene, but now we’regoing to think about alleles in the context of DNA sequences. And a term thatwe’ll commonly use is a single-nucleotide variant, often abbreviated as anSNV. That just means the presence in a population of different nucleotides athomologous positions in the two DNA sequences.

We have to compare homologous positions. Let’s look at the diagram inFigure 1.81. We can tell that this is a diagram of two different DNA sequencesbeing compared. It’s not a diagram showing the base pairing between two dif-ferent strands of one DNA molecule. We can tell that first because it’s labelled.Genome 1, genome 2. Not 5’ end, 3’ end. We can also tell very clearly becauseat every position except for one, the bases are identical. And at every position

Figure 1.81: Two genomes are compared. Their bases are identical at everyposition except for one. The one position where the genomes have differentbases represents a single-nucleotide variant.

64

where the bases are identical, there’s a vertical line. That tells us that thesevertical lines mean this is a position where the two sequences, the two genomeshave identical bases. But there’s one position where the genomes have differentbases. That position represents a single-nucleotide variant; just one positionwhere there is variation, where two different versions are present.

There’s another kind of DNA sequence difference that we need to be awareof, and that’s got the rather corny name of indel. They are genetic differencesnot created by changing one base to a different base, but by inserting or deletingone or more bases into a DNA segment which changes the length of the DNAsegment.

Figure 1.82: An indel causes the difference between genome 1 and genome 2.

An example is given in Figure 1.82. Again we’re comparing two DNA se-quences. But one’s longer than the other. The term indel is used because whenwe’re comparing two DNA sequences and we find that they’re different lengths,we don’t know that the difference, marked in red in Figure 1.82 arose becausesomething was inserted into the sequence of genome 1, or because somethingwas deleted from an ancestor of the genome 2 sequence. And indel is just acompromised word, because we don’t know which it is, insertion or deletion.

Here is an overview of the important new terms.

Alleles: Non-identical versions or variants of a gene or, more generally, ofa DNA sequence.

SNV: A single-nucleotide variant; the presence in a population of differentnucleotides at homologous positions in two DNA sequences.

Indel: A genetic difference created by insertion or deletion of a bp or alonger DNA segment.

1.O.3 Comparing DNA sequences

When we compare two DNA sequences, I emphasized this in Section 1.O.1,they need to be homologous. That means, from our definition of homology thatthey’re similar because of shared descent from a common ancestor. So whenwe’re comparing two positions, we want to be sure that those two positions arethe same or different versions of the same base in a common ancestor, that they

65

were both inherited from a single position. We don’t want to be comparingwhat were different positions in the common ancestor.

Figure 1.83: Two homologous sequences that need to be compared.

Let’s compare the two sequences in Figure 1.83. And I’m telling you thatthese sequences are homologous.

Again, we draw a vertical line for positions where the sequences are identical,which results in Figure 1.84.

Figure 1.84: The same sequences as in Figure 1.83, where we have drawn a bluevertical line at positions where the sequences are identical.

They don’t look very homologous. We expect one base in four to match justby chance. This doesn’t look very promising at all. Maybe they’re not lined upright.

Let’s shift them over one place and try again. The result is given in Fig-ure 1.85. No, that’s still pretty bad.

Figure 1.85: Comparison of the two sequences where sequence 2 has been shiftedone position to the right. The red lines show at which positions the sequencesare identical.

OK, let’s shift them over a bit more, which is shown in Figure 1.86. Aha.Now we have the right alignment and we see that almost every position matches.Now we’re confident that the homologous positions are aligned. That’s criti-cal when we’re comparing sequences because some sequence variation is indels.Insertions and deletions are fairly common in chromosomes. You can’t juststart at the end of two homologous chromosomes and just line up all the basesand compare them. You have to make sure in different positions that they’recorrectly aligned so the alignment reflects homology.

66

Figure 1.86: Sequence 2 has been shifted one more position to the right. Theblack lines show identical sequence positions. We have found the right align-ment.

1.O.4 Allele frequencies in populations

Now it’s time to think about the frequency of alleles and genotypes in a popula-tion because this will be critical for a lot of the analysis that we do. In particular,it’ll be important for the last lecture when we talk about the evolution of humangenetic differences.

The allele frequency in a population is really, really simple:

frequency of an allele =number of that allele

total number of alleles.

It’s just the number of occurrences of that allele divided by the number of totalalleles in the population. That’s the frequency. Sounds easy.

But what if we don’t actually know the number of alleles, you haven’tcounted the alleles. What if you’ve only counted the genotypes? You surveyedthe population, took a sample of people, and you determined their genotypes ata particular position that you’re interested in. And these are the numbers thatyou got.

57 people: CC22 people: CT13 people: TT

92 people

Can you calculate the frequency of one or the other allele? Sure, it’s easy. Weknow the denominator. We know that there are 92 people. Each person hastwo alleles, so that’s 184 alleles in whole sample. We’re assuming that this isa large enough sample, that it’s representative of the population. What is thefrequency of the T allele?

Well, 57 people don’t have a T allele at all. We don’t need to worry aboutthem. 22 people have one T and one C allele. So that’s 22 Ts. And 13 peoplehave two Ts. Those people are going to contribute 26 Ts. In total there are22 + 26 = 48 T alleles. The frequency of T is then 48

184 = 0.26.What about the other way around? What if we want to think about genotype

frequencies in the population? Well, the basic calculation is still really simple:

frequency of a genotype =number of that genotype

total number of genotypes.

It’s just the number of occurrences of a particular genotype out of the totalnumber of genotypes, which is the total number of individuals in population.

67

But what if we only know the allele frequencies? Often when you’re givena description of a population, you’re given the allele frequencies but not thegenotype frequencies. For example, if we want to know the frequency of the ATgenotype and all we have is

frequency of A = 0.8frequency of T = 0.2(C and G not present)

In this case, let’s assume that we have the allele frequencies for two allelesin the population and that the other two alleles aren’t present. As you’ll see,this is a typical case for real alleles. And we want to know the frequency ofthe AT genotype. Now I’m going to show you a trick that might look kind oflike the Punnett squares that you might have encountered if you learned basicgenetics at some point, except I’m drawing this square off-center, representingthe frequencies 0.2 and 0.8.

0.8 A 0.2 T

0.8 A AA: AT:0.82 = 0.64 0.2 · 0.8 =

0.16

0.2T TA TT0.2 · 0.8 = 0.16 0.22 = 0.04

The square is just a way of illustrating what happens when alleles cometogether at random, which is the normal case for almost every aspect of ourgenome. When people married each other and had children, they did not checktheir genotype at this particular position in their genome before they did it.Mating is random with respect to genotype. Using this randomness, we can

68

predict just from basic probability the frequencies of the genotypes that aregoing to be present in the population.

We know that 0.8 of the population has As and we’re going to draw thatA, 0.8 with a width proportional to 0.8. And 0.2 of the population has the Tallele, so this gets a smaller place. We can then simply look at the picture andsay oh, yeah, of course, the chance of A meeting A to give the AA genotype isgoing to be 0.82, which is 0.64. And the chance of T meeting T to give the TTgenotype is going to be 0.22 = 0.04.

However, what we’re interested in is the chance of T meeting A or A meetingT. These frequencies are, again, just the product of the frequencies of the alleles.So it’s 0.8×0.2 = 0.16 for AT and for TA. However, we’ve got two occurrences ofthis genotype, they’re just written in different orders, but it’s the same genotype.So the frequency of the AT (or TA) genotype is 0.16+0.16 = 0.32. We can easilycheck that we’ve done the arithmetic right by checking whether our numbersadd up: 0.04 + 0.64 + 0.32 = 1. The numbers add up to the whole populationso we can be confident that we’ve done the calculation right.

Now that we know how to think about allele frequencies in populations, Ihave to introduce some terminology.

Rare variant: A genetic difference present in < 1% of the alleles in thepopulation.

Polymorphism: A genetic difference present in ≥ 1% of the alleles in thepopulation.

SNP (“snip”): A single-nucleotide polymorphism; an SNV present in≥ 1% of the alleles in the population.

If a genetic difference is present in less than 1% of the population, it’s usuallydescribed as being a rare variant. But if it’s present in at least 1% of the allelesin the population, there’s another term that’s very widely used. It’s called apolymorphism.

Until recently, geneticists have spent a lot more time studying polymor-phisms than studying rare variants just because they’re a lot easier to find andto study. We’re only now developing the tools to study rare variants. And we’lltalk a little bit about this in Module 5.

If we introduce into this, again, the single-nucleotide variant – remembersingle-nucleotide variants are differences that are affecting a single position ina DNA sequence, then we end up with a single-nucleotide polymorphism,abbreviated SNP. And because they’re talked about a lot, we have a way to saythis. They’re called SNIPs. SNIPs are very common variation in populations,they are very easy to study, they’re the foundation of all of the personal genomicsthat we’ll introduce in Module 6. So this is a very important concept to getstraight.

69

Figure 1.87: A single nucleotide variant, which can be a rare variant or a poly-morphism.

Let’s look at the example in Figure 1.87. If each of the variants in Figure 1.87is present in at least 1% of the population, this would be a SNIP. If the secondvariant (with the g base) was very, very common and the first variant (with thea base) was very, very rare, we’d say well, it’s a rare single-nucleotide variant,we’re not going to study it, it’s too difficult.

Most polymorphisms have only two common alleles like the polymorphismthat I introduced when we were doing our arithmetic in Section 1.O.4. InFigure 1.88 is an example.

Figure 1.88: A small segment of the BRCA2 gene showing the known SNIPs.Each vertical pair describes a SNIP at that particular position. There are lotsof SNIPs both in introns and in exons, and there are always only two allelesthat are common, which is usually the case.

Figure 1.88 shows a small segment of the BRCA2 gene that we described inSection 1.J.1. It includes alleles that cause a high risk of breast cancer. Thisdiagram shows the known SNIPs, the known single-nucleotide polymorphismsin this segment of the genome, which is about 20 kilobases long. I’ve circled thefirst variations so you can recognize the diagram. It’s a little tricky to interpret.Each vertical pair describes a SNIP at that particular position.

We start at the left and first there are three different SNIPs present, relativelyclose together, in the intron. They are followed by two more SNIPs in the intronand then there are three SNIPs in the first small exon. Next, there are five SNIPsin an exon followed by one SNIP in the middle of an intron. You can see thatthere are lots of SNIPs both in introns and in exons and that there are alwaysonly two alleles that are common.

70

1.O.5 Haplotypes

I want to introduce one more concept, the concept of a haplotype. We need thisfor the next lecture.

Haplotype: The genotype of a short segment of a chromosome

Instead of being the genotype of a particular position, a haplotype is thegenotype of a segment of the chromosome. It’s used in slightly other variationsin other contexts, but for us for now, this will capture it nicely.

Figure 1.89: The sequences of a little bit of the genomes of two different indi-viduals.

In Figure 1.89 I show the sequences of a little bit of the genomes of twodifferent individuals. And we see that there are four different places where theyhave different bases in their DNA. We would describe the first sequence, thiscollection of variation, this segment of variation as a haplotype. We could saythis is the haplotype that has a t at position 2 and an a at position 16, and a tat position 20 and a c at position 28. The second sequence is a haplotype thathas a g at position 2 and a g position 16 and an a at position 20 and a g atposition 28.

Summary

This has been a long lecture. The longest lecture of Module 1. We’ve introduceda lot of new material building on the previous lectures. We talked about kindsof sequence variation, how we can compare DNA sequences, why it matters howwe line up the bases because some sequence variation is changing the lengthsof DNA sequences. We talked quite a lot about alleles and genotypes in popu-lations. We did some calculations and we introduced a number of new terms.Homozygous and heterozygous, polymorphisms, SNIP, indel, haplotype. Theseterms are going to come up again and again in the course.

71

Lecture 1.P Genetic and evolutionary relation-ships of human populations

Outline

This is the last lecture of Module 1, where we’re coming full circle back tothinking about human genetic differences. And we’ll be using these differencesto get a deeper and an evolutionary picture of the genetic history of our species.We will start in Section 1.P.1 with human similarity. In Section 1.P.2 we willlook at similarities of other species. Section 1.P.3 will discuss the ‘Out of Africa’theory and in Section 1.P.4 we will answer the question “Did humans really matewith Neanderthals?”

1.P.1 Human similarity

Figure 1.90: Within-species diversity in humans

We’ll start basically where we left off in the first lecture, thinking about howdifferent are different people. And we said at that time that two people, onaverage, are almost identical, that their DNA differences are only about 0.1%of all the sequence positions in their genome. This is illustrated in Figure 1.90.

1.P.2 Similarities of other species

Figure 1.91: Within-species diversity in most plants and animals

72

Humans are actually a little more similar than most plants and animalspecies, where on average, there’s about 99%, not 99.9% sequence identity asillustrated in Figure 1.91. This is because we’re actually quite a young species inthe average ages of the species that are alive today. We arose relatively recently,and we haven’t accumulated as much difference.

We can see the whole ranges of within-species differences in the rather over-whelming chart in Figure 1.92. Basically, this is the extent of sequence differencefrom about more than 5% in some worms down to very, very little genetic dif-ference in some species that have been recently very highly inbred. We’ll talkabout this inbreeding a lot later in the course. We’re definitely down at the lowvariation end of the spectrum of species, which is thought to reflect our recentorigins as a species.

1.P.3 Human origins: Out of Africa

Insights into our origins as a species can come from thinking in a larger wayabout the genetic differences between different people from different parts of theworld. I want to describe an important study by Dr. Sarah Tishkoff and hercolleagues (Study by Sarah Tishkoff and co-workers, published in the AmericanJournal of Human Genetics in 2000. Tishkoff et al. 2000 Am J. Hum. Genet.67:901-925).

What they did was they analysed not just single base differences but haplo-types. Remember, haplotypes are segments of DNA with genetic differences, soall the genetic differences in a segment. In this study, they looked at segmentswith four different genes. They sequenced these four genes in 1,400 people fromdifferent places around the world. and they counted the number of differenthaplotypes that they found in those 1,400 people. They found 98 different hap-lotypes in Europeans, 73 in Asians, and 199 in Africans. So already, we cansee that here’s more genetic variation in Africans than there is in people fromother parts of the world. This is actually consistent with other studies of humangenetic differences.

More insight came when they thought about how many of these sequencedifferences were actually shared between different groups. What they discoveredwas that most of the genetic variation that they found in Europeans or in Asianswas also present in Africans as represented in Figure 1.93 by the intersection ofthe three areas. This suggested that Africans were the older and more diversepeoples and that Europeans and Asians represented a subset of the geneticdiversity in Africans.

This result was consistent with and strongly supported a view of humanevolution called the “Out of Africa” model. That model proposed – and weare quite confident that this is indeed true – that humans originated in Africa.Around 150,000 years ago, humans spread through Africa and maybe 100,000years ago, a particular group of these people moved out of Africa into south-western Asia, what we call the Near East. Then they gradually spread, overthousands and thousands of years, into Europe, into southern Asia, into north-ern Asia, into South-east Asia, and to Australia and the Pacific Islands, and up

73

Figure 1.92: The diversity of individuals within a species for several differentspecies. The human species, highlighted in yellow, is at the low variation endof the graph. This is because we are a relatively new species.

74

Figure 1.93: Most of the genetic variation that was found in Europeans and inAsians is also present in Africans.

through eastern Asia, across the Bering Land Bridge, and down into North andSouth-America. This is shown in Figure 1.94.

Figure 1.94: The ‘Out of Africa’ model for the early human migrations isstrongly supported by genetic evidence.

1.P.4 Did humans really mate with Neanderthals?

The “Out of Africa” model for the early human migrations is strongly supportedby many lines of evidence. But what I want to tell you about now is new evidencethat adds a wrinkle to this Out of Africa model. It is addressing the questionof, well, did our ancestors really meet with Neanderthals? You remember thatNeanderthals are these extinct type of humans with big jaws and heavy brows.They’re the canonical, the sort of caveman morphology.

Thanks to a wonderful body of work by Professor Svante Paabo of the MaxPlanck Institute for Evolutionary Anthropology, we know that the answer to this

75

question is yes. He describes his work very nicely in a TED Talk (<https://www.ted.com/talks/svante_paeaebo_dna_clues_to_our_inner_neanderthal>).

Figure 1.95: The evolutionary history of humans and their encounters withNeanderthals: Early humans originated in Africa, spread and diversified. Asubgroup migrated out of Africa and became the Neanderthals. Later, anothergroup of ‘anatomically modern’ humans migrated out of Africa into South-WestAsia, where they encountered the Neanderthals, indicated by the two yellowcircles. The humans and Neanderthals mated and their offspring continued tospread. Some of them encountered Neanderthals again, this time in South-EastAsia (the second pair of yellow circles).

In Figure 1.95 is my drawing of a summary of what we know about whathappened in human evolutionary history.

Early humans originated in Africa and within Africa, they spread and diver-sified. At some point, several hundred thousand years ago, a subgroup of thesepeople migrated out of Africa into Europe and Asia. And these are the peoplethat we call the Neanderthals.

At the same time that the Neanderthals were colonizing this new land, evo-lution was continuing in Africa. The early humans were gradually replaced bywhat we call anatomically modern humans, people whose facial features, whoseskull structures are much more like ours.

The same thing happened as before. The anatomically modern people grad-ually spread through Africa. But one branch of them migrated out of Africa,again, into South-West Asia. We now know that, at this time, they encounteredthe Neanderthals. The early humans in Africa were gone, but the Neanderthalpeople in southern Europe and Asia were still there. So, the anatomically mod-ern people encountered the Neanderthals. They mated with them. They hadchildren. Their children became part of the wave of anatomically modern hu-mans that were spreading into Europe and through Asia into northern Asia,

76

into South-East Asia.In South-East Asia, the anatomically modern humans that were moving

across Asia again encountered Neanderthals. And again, they mated with them,had children and their children became part of the population that was extend-ing into South-East Asia.

The anatomically modern humans continued to spread, including spreadingover into North and South America to be the native peoples of those continents.But the Neanderthals, like the early African humans, died out.

So we know that the ancestors of non-African humans mated with Nean-derthal humans before the Neanderthals went extinct and that some Nean-derthal haplotypes live on in the non-African genomes. African peoples don’thave Neanderthal DNA.

This is all pretty detailed, considering none of us were there to watch thishappen. How do we know this? Well, we know this because Dr. Paabo’s teamhas developed sophisticated ways to sequence DNA from ancient bones, eventhough that DNA is very badly degraded. It’s broken, it’s damaged in manyways. But they can sequence it. They have reasonable-quality whole-genomeDNA sequences from six Neanderthals, including one Denisovan. The Deniso-vans are an extinct human species that is closely related to Neanderthals. Thesecond interbreeding, illustrated by the yellow circles in South-East Asia inFigure 1.95, between humans and Neanderthals was actually with this species.

The team can identify Neanderthal haplotypes by comparing the Nean-derthal genomes with human genomes. What they find is that the genomes ofnon-African people contain 2% to 3% Neanderthal alleles and haplotypes. Andthe genomes of the natives of Papua New Guinea and Australia – that’s theSouth Pacific natives whose ancestors participated in the second interbreeding– their genomes also contain about 6% sequence patterns from the Denisovans.

Summary

We’ve talked about human similarity and human differences. We comparedourselves to other species. Then we’ve done a more detailed analysis of hu-man genetic differences, looking at haplotypes, not just at single position dif-ferences. Analysis of these haplotypes in modern people confirms the Out ofAfrica model for the evolutionary history of humans. And Dr. Paabo’s tech-niques for sequencing ancient DNA, also confirm that some of our ancestorsmated with Neanderthals in their migration out of Africa. Of course, those ofus who are Africans are not involved in that. Now, again, I want to stronglyrecommend Dr. Paabo’s TED Talk (<https://www.ted.com/talks/svante_paeaebo_dna_clues_to_our_inner_neanderthal>). It’s 20 minutes long. Itsummarizes in its introduction just about everything that we’ve talked aboutin this Module. He’s a wonderful, dynamic speaker, so you’ll find it well worthyour while as a refresher for the whole of Module 1.

77

Index

allele, 38, 41, 63, 64, 67, 70frequency, 67, 68

amino acid, 20, 21, 23–25, 27, 30, 32,51

aniridia, 51

base, 4, 8, 9complementarity, 8

base pairs, 37BRCA2 gene, 42, 70

cell, 5, 31, 48, 54, 58, 59cell cycle, 54, 59cell division, 5, 56, 60cell theory, 18centromere, 37, 40–42chromosome, 4, 36, 37, 41, 42, 54, 58,

60, 71homologous, 53, 58representation, 38X, 37Y, 37

codon, 24, 25, 30start, 22–24, 28–32stop, 23, 24, 28–30

common ancestor, 18, 51–54, 65complementarity, 8, 17, 19, 25, 49, 63convergent evolution, 52crossing over, 59, 61

daughter cells, 56deoxyribose, 19diploid, 58–60diversity

human, 72DNA, 4, 9, 29, 33, 35, 36, 54

complementary, 49

daughter, 17double helix, 4, 10evolution, 18homologous, 53information carrier, 10non-functional, 36physical molecule, 8properties, 7replication, 9, 15, 33, 35, 59representation, 12sequencing, 46, 48strand, 8–10

DNA polymerase, 15

echidna, 52egg, 57, 60enzyme, 48error correction, 10evolution, 17, 18, 49, 53exon, 29, 30, 44, 45, 70eyeless, 51

functional RNA, 21, 23

gamete, 57, 59, 60gene, 4, 19, 21, 27, 29–32, 36, 37, 41,

42, 48arrangement on chromosome, 44

genetic code, 24, 25genetic information, 5genome, 5, 21

differences, 6, 7sequencing, 47

genotype, 3, 63, 67, 71frequency, 67

haploid, 58, 59haplotype, 71, 73, 77

78

hedgehog, 52heterozygous, 64homologous chromosomes, 53homology, 49, 52, 64–66

chromosome, 53, 58DNA, 53protein, 53

homozygous, 63hydrogen bond, 8, 9

indel, 65, 66index, 44intron, 29–31, 45, 70

kilobase, 44

life cycle, 5, 54animal, 55multicellular organism, 56plant, 55unicellular organism, 55

locus, 38, 41

Mb, see MegabaseMegabase, 42meiosis, 57, 60Mendel, Gregor, 31messenger RNA, 23–26, 29, 30, 44, 45metagenomics, 48microbiome, 48mitosis, 55, 56Moore’s law, 47mRNA, see messenger RNA

natural selection, 30, 50Neanderthals, 75nucleic acid, 19, 35

open reading frame, 31origins, 36out of Africa, 73

phenotype, 3, 31differences, 5, 7

ploidy, 58pollen, 57polymer, 8, 10, 35

polymorphism, 69, 70population, 67progeny cells, 56promoter, 21–23, 29–31protein, 10, 11, 19–21, 23, 24, 27, 29,

30, 34–36, 45homologous, 53synthesis, 10, 22, 24

Punnett square, 68

rare variant, 69reading frame, 27, 31, 32

open, 31reassortment, 59, 61recombination, 58, 59regulatory protein, 10–12, 31regulatory sequence, 22, 30regulatory signals, 36reproduction

asexual, 56sexual, 56

ribose, 19ribosome, 21, 23, 26, 28ribosome binding site, 28–31RNA, 9–11, 19, 21, 29, 30, 33, 35

abundance, 48functional, 21, 23messenger, 23–26, 29, 30, 44, 45synthesis, 22, 44transfer, 25

RNA polymerase, 11, 22, 30, 31, 44

shared ancestry, 50shuffling, 61signal, 21, 22similarity

other species, 72single-nucleotide polymorphism (SNP),

69single-nucleotide variant (SNV), 64snip, see single-nucleotide polymorphismSNP, 69SNV, 64, 69sperm, 57, 60splicing, 29–31start codon, 22–24, 28–32

79

stop codon, 23, 24, 28–30

telomere, 37, 42terminator, 21, 22, 29–31transcription, 11, 22, 27, 31, 33, 35transcription factor, 30transfer RNA, 25translation, 11, 21–25, 27, 31–33, 35,

45tree of life, 17, 18tRNA, see transfer RNA

80


Recommended