+ All Categories
Home > Documents > Isochores as Extreme Cases of Genes Cooperatively...

Isochores as Extreme Cases of Genes Cooperatively...

Date post: 27-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
22
Isochores as Extreme Cases of Genes Cooperatively Reshaping the Large-scale Genomic Environment William H. Press ,1 and Harlan Robins Los Alamos National Laboratory, Los Alamos, NM 87545 Institute for Advanced Study, Princeton, NJ 08540 1 Corresponding author: D-1 Group, MS F-600, Los Alamos National Laboratory, Los Alamos, NM 87545. E-mail: [email protected] Manuscript received November , 2005. ABSTRACT The genomes of mammals and birds can be partitioned into megabase-long regions, termed isochores, with consistently high, or low, average C + G con- tent. CG isochores contain a mixture of CG-rich and AT-rich genes, while AT isochores contain predominantly AT-rich genes. In CG isochores, the two pop- ulations of genes are statistically distinguishable by Gene Ontology keywords. AT isochore genes are functionally intermediate between AT- and CG-genes in CG isochores. Genes tend to be located at local extrema of composition. The effect is particularly strong for CG isochores, but is also seen for AT iso- chores. Genes “led” in isochore formation rather than “followed”. Isochore formation drove large numbers of non-synonymous codon changes that make little biochemical sense, but that all favor CG richness in coding regions; iso- chore formation must have been driven by strong natural selection. Disparate features of isochores might be explained by a model in which about half of all genes acquired a fitness preference for mutations to C or G, extending with an (e.g., power-law) tail to large distances from the gene. Random fluctuations in gene preference on large genomic scales can be thus amplified into observed isochore structure. 1
Transcript
Page 1: Isochores as Extreme Cases of Genes Cooperatively ...numerical.recipes/whp/biopreprint/draft2obs.pdf · 1Corresponding author: D-1 Group, MS F-600, Los Alamos National Laboratory,

Isochores as Extreme Cases of Genes Cooperatively

Reshaping the Large-scale Genomic Environment

William H. Press∗,1 and Harlan Robins†

∗Los Alamos National Laboratory, Los Alamos, NM 87545†Institute for Advanced Study, Princeton, NJ 08540

1Corresponding author: D-1 Group, MS F-600, Los Alamos

National Laboratory, Los Alamos, NM 87545. E-mail: [email protected]

Manuscript received November , 2005.

ABSTRACT

The genomes of mammals and birds can be partitioned into megabase-longregions, termed isochores, with consistently high, or low, average C + G con-tent. CG isochores contain a mixture of CG-rich and AT-rich genes, while ATisochores contain predominantly AT-rich genes. In CG isochores, the two pop-ulations of genes are statistically distinguishable by Gene Ontology keywords.AT isochore genes are functionally intermediate between AT- and CG-genesin CG isochores. Genes tend to be located at local extrema of composition.The effect is particularly strong for CG isochores, but is also seen for AT iso-chores. Genes “led” in isochore formation rather than “followed”. Isochoreformation drove large numbers of non-synonymous codon changes that makelittle biochemical sense, but that all favor CG richness in coding regions; iso-chore formation must have been driven by strong natural selection. Disparatefeatures of isochores might be explained by a model in which about half of allgenes acquired a fitness preference for mutations to C or G, extending with an(e.g., power-law) tail to large distances from the gene. Random fluctuationsin gene preference on large genomic scales can be thus amplified into observedisochore structure.

1

Page 2: Isochores as Extreme Cases of Genes Cooperatively ...numerical.recipes/whp/biopreprint/draft2obs.pdf · 1Corresponding author: D-1 Group, MS F-600, Los Alamos National Laboratory,

Isochores, so named by Bernardi (Bernardi et al. 1985; Bernardi 2000),are large regions in the human genome, as long as tens of megabases, that areanomalously rich in C and G nucleotides. Isochores analogous to human arefound in the genomes of all mammals and birds (Bernardi 2000), plus asmall number of additional reptiles such as the Nile crocodile (Hughes et al.1999). Invertebrates, and almost all cold-blooded vertebrates, do not manifestisochore structure in their genomes. The putative common ancestor in whichisochores originated is thus an amniote in the Carboniferous period (≈ 300Ma b.p.), although it was not until after the Permian-Triassic extinction (≈250 Ma b.p.) that the carriers of isochores, namely archosaurs, birds, andmammals, proliferated.

Isochores are by no means subtle features in the genome (IHGSC 2001).By way of example, Figure 1 shows the A + T (opposite of C + G) content ofthree human, and three zebrafish, chromosomes, plotted on a common scale.The nucleotide counts are shown as bars in 300 kb bins, with the base ofthe bars at A + T = 0.58, an arbitrary value that approximately divides CGisochores from AT isochores (as we will refer to regions that are not CG-rich).

human chr 10

human chr 11

human chr 12

fish chr 7 fish chr 14 fish chr 21

position (Mb)

A+

T f

ract

ion

0.5

0.6

0.5

0.6

0 10 20 30 40 50 60 70 80 90 100 110 120 1300.5

0.6

0 10 20 30 40 50 60 70 80 90 100 110 120 1300.5

0.6

Figure 1: Local A+T fraction of typical human and zebrafish chromosomes.Counts are shown in nonoverlapping 300 kb windows.

It is not a settled issue whether isochore formation continues today, that

2

Page 3: Isochores as Extreme Cases of Genes Cooperatively ...numerical.recipes/whp/biopreprint/draft2obs.pdf · 1Corresponding author: D-1 Group, MS F-600, Los Alamos National Laboratory,

is, whether CG isochores are continuing to form from AT isochores. However,a body of recent evidence suggests that, on the contrary, isochores are gradu-ally disappearing from mammalian genomes (Duret et al. 2002; Belle et al.2004). If so, then we may view the formation of isochores as a unique eventin our past, a strong evolutionary pressure which first appeared and then dis-appeared. Apart from the obvious question as to what caused this to happen,we may also hope to learn from the isochore-forming event something aboutthe interaction of genes, as primary carriers of functional information, withinthe much larger genome that they inhabit.

It has proved surprisingly difficult to find functional relationships betweenisochores and the genes inside them (Vinogradov 2003; IHGSC 2001). Bydefault, the more conservative view has been that isochores are the result ofthe accumulation of neutral changes caused by (evidently spatially nonuniform)mutation or repair biases. One currently favored model is biased gene conver-sion (BGC) during homologous recombination (Eyre-Walker and Hurst

2001). In the conservative view, genes are passive riders on the isochore back-ground. That is, their noncritical elements, such as synonymous bases in 3rdcodon positions and nonfunctional bases in their 3′ and 5′ untranslated re-gions (UTRs), should evolve towards CG richness along with the rest of anisochore. Indeed, it is well established (Bernardi et al. 1985; Clay et al.1996; Hamada et al. 2003), and easy to show, that the CG content of 3rdcodon positions and 3′ and 5′ UTRs are all strongly correlated with the CGcontent of the surrounding genomic region. However, changes in functionalgene elements should be relatively rare, since (in this picture) gene fitnessshould dominate over fitness-neutral mutational bias.

Less conservative, but also longstanding, is the hypothesis that the evo-lution of isochores was favored by natural selection, for example selectionin warm-blooded vertebrates for DNA that is stable at higher temperature.(Bernardi 2000; Smith and Eyre-Walker 2001). However, several suchhypotheses notwithstanding, the nature of the selection pressure remains ob-scure (Belle et al. 2002; Eyre-Walker and Hurst 2001).

In the conservative view, we should also not expect to see statistically sig-nificant functional differences between genes in an AT versus CG isochore, sinceduring isochore formation, the (pre-existing) population of genes are simplyhitchhikers. However, without reference to isochores, we have previously shown(Robins and Press 2005) that AT-rich and CG-rich genes are readily distin-guishable, statistically, by gene functionality. In particular, AT-rich genes arepreferentially associated with evolutionarily “early” biological processes suchas transcription and mRNA processing, while CG-rich genes are associatedwith evolutionarily “late” processes such as receptors, signal transduction andsignaling cascades.

3

Page 4: Isochores as Extreme Cases of Genes Cooperatively ...numerical.recipes/whp/biopreprint/draft2obs.pdf · 1Corresponding author: D-1 Group, MS F-600, Los Alamos National Laboratory,

Below, we will show that, as one would expect, AT- and CG-rich genesare associated with the corresponding AT and CG isochores, though not quitesimply: Genes in AT isochores are predominantly AT-rich, while genes inCG isochores, along with regions flanking each gene, may be either AT-richor CG-rich, resulting a complex landscape of AT-rich intrusions into what areotherwise CG isochores. Furthermore, the two groups of genes in CG isochores,AT-rich and CG-rich, are statistically distinguishable by function.

We also find evidence that genes that became CG-rich (in CG isochores)were far from passive passengers. They were not dragged along; rather, theyseem to have “led the charge,” while carrying large intergenic neighborhoodswith them. Additional evidence of strong selection is that the amino acidusage changed substantially for these CG-rich genes, as we show by aligninghuman and zebrafish genes. That genes are at special locations of compositionis already suggested visually, at least for CG-rich genes, if one simply looks atthe position and composition of genes relative to window counts (Figure 2),where an unexpected number of CG-rich genes seem to occur in bins that areextrema. We give a more quantitative test below.

The picture that emerges is one in which genes compete to control thegenomic AT or CG richness on length scales much larger than the size of a singlegene. The AT-rich and CG-rich population each try to achieve the environmentthat maximizes its fitness. If fitness is affected by the environment on a scalelarger than a single gene (we suggest some possible mechanisms in Discussionbelow), then neighborhoods of genes can, in effect, cooperate in favoring, byselection, certain intergenic mutations. In this picture, isochore formation mayhave resulted from a large, aberrant, and possibly temporary advantage gainedby the CG-favoring gene population. In support of this view, we show evidencethat, in fish, the same competition between two gene populations is occuring(or did once occur), but that the strength of its interactions at long distancesis much weaker than in human, so that some critical threshhold is not reached.

METHODS

Defining Gene Populations and Large-Scale Isochores: We use A + Tand C − G counts in a gene’s 3′ UTR to determine whether it is an AT-richor CG-rich gene, applying the algorithm given in Robins and Press (2005),equation [1], to get a probability. This method was shown to yield the cleanestseparation of the two gene populations.

Since isochores are not homogeneous (IHGSC 2001, and cf. Figure 1),a precise definition is perforce somewhat arbitrary. However, if one plotsthe above AT- versus CG-rich probability for each gene along the genome,as in Figure 3, a clear pattern emerges: Some regions extending over manymegabases contain predominantly AT-rich genes, while other regions contain

4

Page 5: Isochores as Extreme Cases of Genes Cooperatively ...numerical.recipes/whp/biopreprint/draft2obs.pdf · 1Corresponding author: D-1 Group, MS F-600, Los Alamos National Laboratory,

25 30 35 40 450.4

0.5

0.6

0.7

60 65 70 750.4

0.5

0.6

0.7

A+

T f

ract

ion

A+

T f

ract

ion

position (Mb)

chr 1 (within CG isochore)

chr 1 (within AT isochore)

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●● ●

●●●

●●●

●●

●●

● ●

Figure 2: Two regions of human chromosome 1, plotting A + T counts in 20kb windows, and showing the location of all RefSeq genes. Genes are plottedat the A + T value of the window in which they occur, but with their colorcontinuously varying from red (AT-rich gene) to blue (CG-rich gene). Genes ina CG isochore (upper panel), notably CG-rich genes, tend to be more extremethan their surround; there is less such tendency in an AT isochore (lowerpanel).

a more equal mixture of AT- and CG-rich genes. There are few, if any, largeregions containing predominantly CG-rich genes, consistent with previous evi-dence (Pavlicek et al. 2002) that CG isochores, however defined, have largercompositional variances than do AT isochores.

We can therefore define isochore boundaries by a Markov model that al-ternates between two states, AT-dominant and mixed. In the AT-dominantstate, the respective probabilities of an AT-rich and CG-rich gene are taken as(0.9, 0.1), while in the mixed state they are taken as (0.5, 0.5). The state tran-sition probability between any two consecutive genes is taken as 0.001 (thatis, 0.999 chance of remaining in the same isochore state). We then use thestandard forward-backward method to find the probability, at each gene, ofits being in the AT-dominant state (which we now term an AT isochore) orthe mixed state (which we call a CG isochore). We find that this classificationis quite insensitive to varying all of the parameters above. In particular the

5

Page 6: Isochores as Extreme Cases of Genes Cooperatively ...numerical.recipes/whp/biopreprint/draft2obs.pdf · 1Corresponding author: D-1 Group, MS F-600, Los Alamos National Laboratory,

0

1

0

1

0 10 20 30 40 50 60 70 80 90 100 110 120 130

0

1

prob

abili

ty th

at g

ene

is in

AT

-ric

h po

pula

tion

position (Mb)

●●

●●

●●

●●

●●

●●

●●●

●●●●●

●●

●●

●●●●

●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●

●●

●●●●

●●●●●●

●●

●●●

●●●

●●●●

●●●●

●●●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●

●●●

●●

●●●●●

●●●

●●●

●●

●●●●

●●

●●

●●●

●●

●●●●●●●

●●●

●●●●●

●● ●●

●●●●

●●●●●●●●●●●●

●●●

●●

●●●●

●●

●●

●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●

●●●●

●●●

●●●

●●●

●●●●●

●●

●●●

●●

●●●●●●

●●

●●●

●●●●●●

●●

●●

●●

●●●●●●●●

●●

●●

●●●●●●

●●●

●●

●●●●

●●

●●

●●●●●

●●

●●●●●●●●●●●

●●●●

●●●

●●●●●●

●●●●●●●●

●●●●●

●●

●●●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●●●

●●●●●

●●●●

●●

●●●●●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●●●●●

●●

●●

●●●●

●●

●●●●●●●●

●●●●●●●●●●

●●

●●●●●●●●●●

●●

●●

●●●

●●●

●●

●●

●●● ●

●●●●●●●

●●●●

●●●

●●

●●●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●●●●●●●●

●●●●●●●●●

●●●

●●●●●●●●●●

●●●●

●●●●●

●●●

●●

●●

●●●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●●●●●●●●●●●●●●●●●●

●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●

●●●

●●●

●●

●●

●●●

●●●●●●●●

●●

●●●●●●●●●●

●●●●●●●

●●

●●●●●●●●●

●●●

●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●

●●●

●●

●●●●●●●

●●

●●●

●●

●●

●●●●

●●

●●

●●●

●●●

●●●●

●●●●●●●

●●●

●●

●●●●●●

●●●●●

●●●●

●●●●●●

●●●●

●●●●

●●

●●

●●●● ● ●●

●●●●●●●●●●●●

●●●●●

●●

●●

●●●●

●●

●●●●

●●

●●

●●

●●●●●●

●●●●●

●●

●●●●● ● ●

●●

●●●●●●

●●●●●●●●

●●●●●●●●●

●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●●●

●●●●●●

●●

●● ●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●

●●●

●●

●●

●●

●●

●●●●●●●

●●

●●●●●

●●●

●●●●●●●

●●

●●●●●

●●●

●●●●●

●●●●

●●

●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●

●●

●●●●●●●

●●

●●

●●●●●●●●●●●●●●●●●●

●●

●●●●

●●●●●●●●

●●●

●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●

●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●

●●

●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●

●●●

●●●●●

●●●

●●

●●

●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●●●●

●●●

●●●

●●

●●

●●●

●●●●●●●●●●●●●●●●

●●

●●

●●● ●

●●

●●

●●

●●●●●

●●●●

●●●●●●

●●

●●●●●●●●●●●●●

●●●

●●●●●●

●●●

●●●●●●

●●

●●●●●●

●●●

●●

●●●

●●

●●

●●

●●●●

●●●●●●

●●●

●●

●●●●

●●●●

●●●●●●●●●●●●

●●●

●●

●●

●●

●●●●●●

●●

●●●●●●

●●

●●●●

●●●●●●●

●●●●●●●

●●●●●

●●●

●●

●●

●●●●●●●●

●●●●●●●●●●●

●●●

●●

●●

●●

●●●●●●●●●●

●●

●●

●●●

●●●●●●

●●

●●●●

●●

●●

●●

●●●●

●●●

●●

●●●●

●●

●●●●●

●●

●●

●●

●●●●

●●●●●

●●

●●

●●

●●●●●●●●●

●●●●●

●●

●●●●

●●●●●●●●●●

●●●●●

●●●●●●●

●●

●●●

●●●

●●●

●●●●●●●

●●

●●

●●

●●●

●●

●●●

●●●●●●●●●●

●●

●●

●●●●●●●

●●

●●●

●●●●●●●●

●●●●●●

●●●

●●

●●

●●

●●●

●●

●●●●●

●●

●●●●

●●

●●●●●●●●●

●●

●●

●●●

●●●●●●●

●●●

●●●●●

●●●●●●

●●

●●●

●●

●●

●●

●●●

●●●●●●

●●●

●●●

●●●●●●●● ●●●●

●●

●●●●●●●●●

●●

●●●●●

●●●●●●●

●●●●●●

●●

●●●

●●●●

●●

●●●●●

●●

●●

●●●●

●●

●●●●

●●●●●●●

●●●●●

●●●●●●

●●●●●●●

●●

●●●

●●●●●●

●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●

●●●●●

●●●

●●●●

●●●

●●

●●●●●

●●●●●●●

●●●

●●

●●

●●

●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●

●●●●

●●

●●

●●

●●●●●●●●●

●●●

●●

●●

●●

●●●

●●●

●●●●●●●

●●

●●●●●●●●●●●

●●

●●●

●●● ●●●

●●●

●●●

●●

●●●●

●●●●●●●

●●●●

●●

●●●

●●●●●●●●

●●●●

●●●●●●

● ●●

●●●

●●●●●●●●

●●

●●●●●●

●●●

●●●●●●● ●●●

●●●

●●●●●● ●●●●●●●●

●●●

●●●

●●●●●●●●

●●●●

●●●●●●●●●

● ●●●●

●●●●●●●●●●●

●●●

●●

●●

●●●

●●●

●●●●●

●●●

●●●

●●●●

●●●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●●●●●●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●●●

●●

●●●●●●●●●●●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●●●

●●●●●

●●

●●

●●

●●●●

●●●

●●●

●●●●●●●

●●

●●●●●●●●●●●●

●●

●●●●

●●

●●

●●●●●

●●●

●●

●●●

●●●

●●●●●●●●●●●

●●●●●

●●

human chr 10

human chr 11

human chr 12

Figure 3: RefSeq genes plotted according to their probability of being in theAT-rich population, for three typical chromosomes. Large regions of AT-richgenes, and of mixed AT- and CG-rich genes are evident. Large regions ofCG-rich genes are conspicuously absent.

transition probability can be varied over orders of magnitude, because the mul-tiplicative probabilities of a relatively small number of genes can easily force astate transition, even if its a priori probability is unrealistically small. Resultsare shown in Figure 4.

Comparing Figures 1 and 4, one sees that the above Markov model largelycaptures one’s visual impressions of large scale structure, but now objectively(at least up to choice of model parameters). We can also validate the gene-based model by comparing it to a similar Markov model that uses raw 300kb window counts instead of genes, shown as the red line in Figure 4. Inthis model, we assign a 300 kb window to the high state if its count of A+Texceeds 0.565, a not untypical value in the isochore literature (Pavlicek et al.2002). An AT isochore is taken to have high or low windows with respectiveprobabilities (0.75, 0.25). A CG isochore has (0.5, 0.5), again reflecting itsrelatively larger variances. The transition probability is 0.001, as before. Theresults of this model are shown as the red line in Figure 4, and are insensitiveto the adopted parameters. Our gene-based and window-based models forisochore identification agree in 93% of all locations in the human genome.

In characterizing variations on large, megabase scales, we necessarily misssmaller scale features, predominantly AT-rich intrusions into CG isochores.These show up as an increase in the observed variance. It is a matter ofsemantics whether or not to to regard these features as small isochores (Cohen

6

Page 7: Isochores as Extreme Cases of Genes Cooperatively ...numerical.recipes/whp/biopreprint/draft2obs.pdf · 1Corresponding author: D-1 Group, MS F-600, Los Alamos National Laboratory,

0

1

0

1

0 10 20 30 40 50 60 70 80 90 100 110 120 130

0

1

prob

abili

ty o

f la

rge-

scal

e A

T is

ocho

re

position (Mb)

human chr 10

human chr 11

human chr 12

Figure 4: Green line: Isochore boundaries obtained by applying a Markovmodel with two states: “AT-rich genes” and “mixed genes”. Red line: Isochoreboundaries obtained by a similar model using raw counts in 300 kb windows.

et al. 2005).

Assessing Late vs. Early Gene Populations by GO Score: In previouswork using Gene Ontology (GO) keyword counts, we characterized results bytheir statistical significance (t- and p- values). Here, we will want somethingmore like a linear scale, so that a mixture of two populations will have a scorethat lies appropriately between the scores for the populations individually.

Using results from Robins and Press (2005), we define a set of “early”indicator words as the following: nucleic-acid, nucleus, transition-metal, zinc,bound, ZNF*, RNA, mRNA, DNA, nucleobase, nucleoside, translation. Wedefine a set of “late” indicator words as: signal-transduction, signaling cas-cade, receptor, transducer, communication, signal, transmembrane, channel,immune, pore. Let NE be the total number of occurrences of early wordsacross the genome (e.g., in RefSeq genes), and NL be the corresponding num-ber for late words. Define rEL ≡ NE/NL. (For the RefSeq genes we haveNE = 31406, NL = 16585, and rEL = 1.89.)

Now suppose that we have a large, probabilistically known, set of genes,meaning that we can assign a probability pi of gene i’s being in the set, andΣipi � 102 (say). Then we define that set’s “Late Minus Early Score” (LMES)by

LMES ≡ rEL

∑i

∑j∈L piδij −

∑i

∑j∈E piδij

rEL

∑i

∑j∈L piδij +

∑i

∑j∈E piδij

(1)

7

Page 8: Isochores as Extreme Cases of Genes Cooperatively ...numerical.recipes/whp/biopreprint/draft2obs.pdf · 1Corresponding author: D-1 Group, MS F-600, Los Alamos National Laboratory,

Here L is the set of late words, E is the set of early words, and δij is 1 ifword j occurs for gene i, zero otherwise. By construction, LMES of the wholegenome is zero. It is 1 for a set of genes that have no early words, and −1 fora set of genes that have no late words.

Usefully, we can also estimate the error for the LMES:

σ(LMES) ≈√

r2LE

∑i

∑j∈L p2

i δij +∑

i

∑j∈E p2

i δij

rLE

∑i

∑j∈L piδij +

∑i

∑j∈E piδij

(2)

The approximation made is to ignore the error in the denominator of equation(1) as compared to that of the numerator. This is because (with foresight) itwill turn out that the LMES score is never larger than a few tenths.

Equation (2) allows us to compare different sets of genes for statisticallysignificantly different LMES’s.

Determining Whether Genes Lead or Lag a Compositional Trend:As discussed above, it is important to have an objective measure of whethergenes lead or lag trends in C + G or A + T . One measure of this tendencyis to compare A + T at a gene’s location with A + T at the midpoint of theintergenic region between the gene and its next neighbor. Referring to Figure5, if genes lead the trend (as shown in panel A) we should get a differentcorrelation between gene and intergene than if genes lag the trend (as shownin panel B). Panel D shows the two cases schematically.

A difference between the variance of genes and that of intergenes due to anyother effect can confound the proposed measurement. For example, if geneshad a smaller variance in their A + T composition for functional reasons, thiswould bias the measurement toward panel B. Or, if the measurement accuracyof A + T were poorer for genes (due to a smaller counting length) than forintergenes, then panel A would be erroneously favored. To mitigate thesekinds of systematic errors, we adopt the strategy shown in Figure 5, Panel C:We characterize a gene’s A + T exclusively by its introns, which should havethe least functional constraints; and we make intergenic counts with exactlythe same window pattern as the gene to which they are being compared. Ifthere are residual systematic biases in the introns (which do contain somefunctionality), we expect them to show up as a systematic shift in A + T ,not a change in the variance. (In fact, below, we will see such small shifts.)The signature of leading genes is a positive correlation between gene and gene-minus-intergene. The signature of lagging genes is a negative correlation.

Correlations and Structure Functions: We will want, below, to character-ize the long-range correlational properties of A+T in the genome, noting that,

8

Page 9: Isochores as Extreme Cases of Genes Cooperatively ...numerical.recipes/whp/biopreprint/draft2obs.pdf · 1Corresponding author: D-1 Group, MS F-600, Los Alamos National Laboratory,

AT rich

CG rich

AT rich

CG rich

genes lead trend

genes lag trend

count in intron

count in congruent intergene window

A

B

C

gene minus intergene

gene (

intr

on)

A+

T

gene minus intergene

gene (

intr

on)

A+

T+ +

++

Expect from (A) : Expect from (B) :

D

Figure 5: Strategy for measuring whether genes lead (A) or lag (B) a trend.The gene’s composition is measured by counts in its introns only (C). Countsin the adjacent intergenic region are made with an identical window function.We expect (D) a positive correlation between gene and gene-minus-intergeneif genes lead the trend, a negative correlation if they lag.

because of isochores, there are significant fluctuations on the largest scales. Aconventional measure, the covariance on scale L of a quantity s(x), is definedby

C(L) ≡ 〈s(x)s(x + L)〉 (3)

where angle brackets denote the population average. However, this quantity isnot invariant under adding a constant to s(x). If, as generally the case, we wantto characterize the fluctuations of s with respect to its own mean, and not somepredetermined zero point, then equation (3) demands a knowledge of the meanthat may be practically unobtainable with finite data. In such circumstances,it is better to use the first-order structure function V (L), defined by

V (L) ≡ 12

⟨[s(x + L) − s(x)]2

⟩=

⟨s2

⟩ − C(L) (4)

where the formal relation to the covariance is indicated, namely a change ofsign and an additive constant. It can be shown (Rybicki and Press 1992)that all properties of an unbiased correlational model (roughly, a model with

9

Page 10: Isochores as Extreme Cases of Genes Cooperatively ...numerical.recipes/whp/biopreprint/draft2obs.pdf · 1Corresponding author: D-1 Group, MS F-600, Los Alamos National Laboratory,

Table 1: RefSeq Genes by Gene and Isochore AT- or CG-richness

Gene typeIsochore type AT CG Total

iAT 28% ≤ 7%1 35%iCG 19% 46% 65%Total 47% 53% 100%

1 iAT/CG genes are likely to be overcounted (see text).

no a priori preferred zero point) depend only on V (L), and not on 〈s2〉 or C(L)separately.

RESULTS

Characterizing the Three Gene Groups: With the above methods, wecan assign to each gene a probability of being in the AT-rich (versus CG-rich) population, and, separately, a probability of being in an AT (versus CG)isochore. The results are shown in Table 1. We adopt the notation iAT andiCG as denoting isochores, AT and CG as denoting genes, so that the threeprincipal populations are iAT/AT, iCG/CG, and iCG/AT. Although thereare undoubtedly some genuine iAT/CG genes, many or most genes that weclassify as iAT/CG are probably the result of misidentified isochore boundaries.Therefore we will often restrict our attention to the three principal groupsmentioned above. It is previously known that CG-rich regions have highergene density and smaller gene lengths (IHGSC 2001).

The A + T fraction of genes classed as iAT/AT is significantly higher thanthose classed as iCG/AT, 53.6% versus 46.0% (3rd codon position counts).Part of this difference is likely due to false positives from the larger number ofiCG/CG genes, since the AT-rich and CG-rich gene populations are overlap-ping distributions. For iCG/CG genes, the A + T fraction is 30.1%.

GO Signature Is Strong in CG Isochores, Weak or Absent in ATIsochores: The LMES score was defined above to be zero over the average genepopulation, positive for groups of genes with “late” GO keywords (like “signaltransduction”) and negative for groups of genes with “early” GO keywords(like “nucleic acid”). Scores, and uncertainties, for the four gene groups are asfollows: 0.102 ± .006 for iCG/CG; −0.239 ± .009 for iCG/AT; −0.010 ± .009for iAT/AT; and 0.019 ± .018 for iAT/CG (the larger uncertainty from thesmaller population size).

10

Page 11: Isochores as Extreme Cases of Genes Cooperatively ...numerical.recipes/whp/biopreprint/draft2obs.pdf · 1Corresponding author: D-1 Group, MS F-600, Los Alamos National Laboratory,

What is remarkable is that the largest positive and negative scores, byfar, are for genes in CG isochores, while genes in AT isochores have LMESscores consistent with zero. In other words, AT-rich genes in CG isochores arefunctionally more extreme (“early”) than AT-rich genes in AT isochores, evenas their nearby neighbors on the genome, the iCG/CG genes, tend strongly to“late” functions. This effect is not a correlation with AT richness – indeed, ithas the opposite sign – since iCG/AT genes are markedly less AT-rich thaniAT/AT genes. The observed effect is likewise opposite to what would beexpected from any misclassifying iCG/CG genes as iCG/AT.

The average LMES scores for genes in CG and AT isochores are, respec-tively, 0.003± .006 and −0.006 ± .008, that is, statistically zero. It is strikingthat the CG isochores are so accurately zero, since that value is obtained onlyby averaging a large positive (iCG/CG) and even larger negative (iCG/AT)value in just the right proportions.

These data suggest that AT and CG isochores in fact contain the same mix-tures of functionality (average LMES zero), but that only in CG isochores havethese differences been made visible as differences in gene AT richness. This isconsistent with a scenario in which genes in AT isochores never experienced theselective pressure that created the isochores, while genes in CG isochores werethus challenged, but with dramatically different (and functionally correlated)responses, varying from gene to gene.

Human Genes Lead Their Surround in Becoming CG Rich: Figure 6shows the result of applying the lead-or-lag methodology described above (andin Figure 5) to the human genome. A significant positive correlation betweengene and gene-minus-intergene counts is seen for all three gene populations,most strongly for iCG/CG genes. This indicates that all genes have sometendency to lead their flanking sequence towards (depending on the gene) CG-or AT-richness – a signal that selection pressure is acting on the genes – andthat the tendency is by far strongest for CG-rich genes. As plotted, Figure6 does not exclude repeating elements, but the results are not significantlydifferent if we exclude either (i) all elements identified by RepeatMasker, or(ii) only the most common LINE and SINE elements.

Zebrafish Genes Also Show AT or CG Preferences: We can perform anidentical analysis on the D. rerio genome, which lacks any visible isochores,with results shown in Figure 7. Unexpectedly, we again find a clear signatureof genes leading their surround in relative AT- and CG-richness. Different fromhuman, however, the effect in fish is much more symmetric between those genesthat “prefer” AT, versus CG. The magnitude of the effect in fish is considerablysmaller for CG preference than in human (left-hand side in Figures 6 and 7),

11

Page 12: Isochores as Extreme Cases of Genes Cooperatively ...numerical.recipes/whp/biopreprint/draft2obs.pdf · 1Corresponding author: D-1 Group, MS F-600, Los Alamos National Laboratory,

0.2

0.3

0.4

0.5

0.6

0.7

-0.2 -0.1 0 0.1 0.2

intronic A+T difference (gene minus surround)

intr

onic

A+

T (

gene

)nu

mbe

r of

gen

es

● ●

●●

● ● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●● ●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●

●●●

●●

●●

● ●

● ●

●●

● ●

●●

● ●

●●

●● ●

●●

●●

● ● ●

● ● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●●

● ●●

●●

●●

● ●

●●

● ●

●●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

● ●

●●

●●●

● ●

●●●

●●

●●

● ●

●● ●

● ●

● ●

●●

● ●●

●●

●●

●●

●●

● ●●

● ●

● ●

● ●

● ●

●●●

● ●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●●

●●●

● ●

● ●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

● ●●

●●

● ●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●●

● ●

● ●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●

●●

iCG/CGiCG/ATiAT/AT

Figure 6: Results of testing whether genes lead or lag. Blue, red, and greendenote respectively CG genes in CG isochores (iCG/CG), AT genes in CGisochores (iCG/AT), and AT genes in AT isochores (iAT/AT). All three genetypes tend to lead their surround, most strongly for iCG/CG (compare Figure5D).

though it seems comparable for AT preference (right-hand side) or for genesin AT isochores (green points in Figure 7). The distribution of fish is visiblycloser to being a multivariate Gaussian, as is often characteristic of a processin equilibrium. The human distribution suggests disequilibrium, either activeor frozen. The D. melanogaster genome gives results similar to D. rerio, sothe effect is not limited to vertebrates.

If it were not for iCG/CG genes in human (blue points in Figure 7), fishand human would look remarkably similar in the two Figures. Not only doboth show the same (weak) correlation between gene and gene-minus-surround,but both also share approximately the same offset in the mean of gene-minus-surround, visible in the green histograms in each figure. This offset might beexplained by AT-rich functional sequences within the gene introns.

The data for human, particularly when compared to fish, directly implicateiCG/CG genes as being not merely passive passengers in the formation of CGisochores, but rather active leaders. The evolutionary pressure that createdisochores, as distinct from the more balanced AT versus CG preference seenin fish, appears to be (or have been) concentrated on iCG/CG genes. For this

12

Page 13: Isochores as Extreme Cases of Genes Cooperatively ...numerical.recipes/whp/biopreprint/draft2obs.pdf · 1Corresponding author: D-1 Group, MS F-600, Los Alamos National Laboratory,

0.4

0.5

0.6

0.7

0.8

0.9

-0.2 -0.1 0 0.1 0.2

intronic A+T difference (gene minus surround)

intr

onic

A+

T (

gene

)nu

mbe

r of

gen

es

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●● ●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●●

●●

● ●●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●● ●

● ●

●●

●●

●●

●●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●● ●

●● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ● ●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

● ●

●●●

●●

●●

●●

●●

●●

● ●

● ●

●●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

●●

● ●

●●

●●●

●●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●● ●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●● ●

●●

●●

●●

● ●

●●

● ●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●

●●

●●

●●

●●●

Figure 7: Same as Figure 6 but for Zebrafish. Genes again lead their surround,but within a narrower range, and more symmetrically for those favoring ATversus CG. Human AT isochores (green in Figure 6) are seen to be qualitativelysimilar to (all) fish genes. Note also the small offset from zero in both cases,which may be explained by AT-rich functional elements in introns.

reason we now look more closely at the molecular evolution of those genes.

Isochore Evolution Drove Significant Changes in Protein AminoAcid Usage: It is a truism that functionality in the genome is strongly con-centrated in genes. If genes adapt to an evolutionary pressure by beocmingmore CG rich, for example, then there should be an associated positive changein fitness. We can gauge the size of such a fitness advantage, qualitatively atleast, by looking at what the gene is willing to trade for it. Mutations to Cand G in introns or untranslated regions, except for small protected fuctionalsequences such as splice sites or transcription factor binding sites, should berelatively neutral. Synonymous mutations in 3rd codon positions may imply asmall fitness cost, since codon usage can affect translation rates (Levy et al.1996; Zolotukhin et al. 1996; Wells et al. 1999). Non synonymous codonmutations that change the protein produced are of a wholly different order,since they affect the posttranslational function of (e.g.) an enzyme or tran-scription factor. These mutations should imply, on average, quite a significantfitness cost if they go beyond the very restricted set of amino acid substitutions

13

Page 14: Isochores as Extreme Cases of Genes Cooperatively ...numerical.recipes/whp/biopreprint/draft2obs.pdf · 1Corresponding author: D-1 Group, MS F-600, Los Alamos National Laboratory,

that are considered (nearly) neutral.We have examined the aligned sequences of all human iCG/CG genes and

their known zebrafish orthologs, and counted the frequency with which aminoacids are substituted. The resulting 20 × 20 table of counts may be looked atfrom two viewpoints: From a biochemical perspective, we may ask whetherthe substitution patterns “make sense” in favoring substitutions that are closein chemical property. Or, from a genomic perspective, we may ask whetherthe substitutions seem driven by an imperative to increase C + G, despite anapparent biochemical cost in fitness.

A first observation is that amino acid usage differs very significantly be-tween human and fish coding regions. The difference is greatest betweeniCG/CG human genes and their fish orthologs, and less for iAT/AT andiCG/AT. For example (Table ), for iCG/CG genes, proline, alanine, andglycine usage is respectively 20%, 19%, and 13% higher in human than infish, while asparagine, isoleucine, and methionine usage is respectively 21%,18%, and 17% lower. One notices immediately that the former have exclusivelyC or G in the first and second codon positions, while the latter have A and T.

Y

P

S

W

M

L

I V

CG A

N

R

K

E

F QH

T

D

polar

hydrophobic

small

tiny

charged

aliphatic

aromati

c

proline

neg.

pos.

Figure 8: Principal amino acid changes from fish to human orthologs, forCG-rich human genes in CG isochores. Green denotes fractionally most de-creasing, red most increasing, amino acids. For these, the four most frequentsubstitutions are shown. The direction of all arrows is from fish to human. Theobserved trends make little sense biochemically, but can all be explained by astrong preference for amino acids with CG-rich codons. (Underlying diagramafter Betts and Russell 2003.)

14

Page 15: Isochores as Extreme Cases of Genes Cooperatively ...numerical.recipes/whp/biopreprint/draft2obs.pdf · 1Corresponding author: D-1 Group, MS F-600, Los Alamos National Laboratory,

Table 2: Amino acid usage changes, fish orthologs to human CG-rich genes

Amino % Relevant Largest four “came from” (when +%),acid change codons or “went to” (when −%), and codon change

Pro +20 Ser Leu Ala Thrccn (t→c)cn c(t→c)n (g→c)cn (a→c)cn

Ala +19 Ser Thr Val Glygcn (t→g)cn (a→g)cn g(t→c)n g(g→c)n

Gly +13 Ser Ala Glu Argggn (a→g)gn g(c→g)n g(a→g)n (a→g)gn

Arg +6 cgn Lys Gln Ser Leuagn a(a→g)n c(a→g)n (tc→cg)n c(t→g)n

· · ·Lys −13 Arg Gln Glu Ser

aan a(a→g)n (a→c)an (a→g)an complex

Met −17 Leu Val Ile Thratg (a→c)tg (a→g)tg at(g→n) a(t→c)g

Ile −18 Val Leu Thr Alaatn (a→g)tn (a→c)tn a(t→c)n (at→gc)n

Asn −21 Ser Asp Gly Thraan a(a→g)n (a→g)an (aa→gg)n a(a→c)n

All codon changes increase C + G except five in italics, which are neutral.

15

Page 16: Isochores as Extreme Cases of Genes Cooperatively ...numerical.recipes/whp/biopreprint/draft2obs.pdf · 1Corresponding author: D-1 Group, MS F-600, Los Alamos National Laboratory,

Figure 8 shows a biochemical perspective. For each of the eight aminoacids with the greatest positive or negative changes in usage between fish andhuman, arrows are shown indicating the four most frequent substitutions. Theunderlying diagram, after Betts and Russell (2003), puts closely substi-tutable amino acids close to each other. One sees that only a few of the mostfrequent observed substitutions make biochemical sense, e.g., Lys (K) to Arg(R), or Ile (I) to Leu (L), while many others are, figuratively and literally, astretch: Lys (K) to Ser (S), Leu (L) to Arg (R), Leu (L) to Pro (P), Thr (T)to Pro (P), etc. It is hard to imagine that there are not significant functionalconsequences in making these kinds of substitutions in ≈ 10% of particularamino acids. Conversely, it would be interesting and unexpected if such largenumbers of amino acid positions in proteins are so broadly substitutable.

If the data in Figure 8 don’t make sense biochemically, they do make sensewhen mapped into the genetic code. As shown in Table , of the 32 largestsubstitutions, 27 can be explained as single mutations in the first or secondcodon that change an A or T into a C or G. The remaining five are all CG-neutral. None of 32 are codon changes favoring A or T.

The data are thus indicative of strong natural selection towards CG rich-ness, affecting not only base usage in 3rd codon positions, 3′ and 5′ UTRs, andintronic regions, but also directly assaulting a gene’s functionality by causinglarge numbers of seemingly otherwise inexplicable amino acid substitutions.

Correlation Lengths Are Large in Human, Small in Fish: As charac-terized thus far, the posited fitness pressure towards CG richness would notproduce isochores, but only regions of CG richness surrounding each iCG/CGgene; they would be “microisochores”. In other words, we have shown thatgenes lead their immediately bordering intergenic region, but said nothingabout the spatial scale of their influence.

Figure 9 shows the pattern of spatial correlation, characterized by the struc-ture function V (L), equation (4), for human and fish. By definition, all curvesshown go to zero at L = 0. Given the visibility of human isochores, it is notsurprising that the structure function is still rising (showing residual correla-tion) out to ≈ 10 Mb. In fish, by contrast, the structure function is constant(no correlation), except for noise fluctuations, on any scale probed, that is,any scale larger than ≈ 100 kb.

From correlation alone, one cannot impute causality. If, however, the evi-dence of previous sections is taken as showing that isochores were adaptationsto a natural selection acting on genes, then a causal interpretation of Figure 8is that the selective pressure on a human gene is (or once was) influenced bythe nucleotide composition of its flanking sequences out to many megabasesdistance. No such effect is seen in fish.

16

Page 17: Isochores as Extreme Cases of Genes Cooperatively ...numerical.recipes/whp/biopreprint/draft2obs.pdf · 1Corresponding author: D-1 Group, MS F-600, Los Alamos National Laboratory,

0 1 2 3 4 5 6 7 8 9 100.005

0.006

0.007

0.008

0.009

length L (Mb)

stru

ctur

e fu

nctio

n 2V

(L)

gene - gene

gene - intergene

intergene - intergene

gene - gene (fish)

Figure 9: Correlation of AT richness with scale, as characterized by the struc-ture function V (L), equation (4). In human, gene-gene correlation is readilyapparent (structure function increasing) out to ≈ 10 Mb. Gene-gene correla-tion is stronger than gene-intergene or intergene-intergene. Fish has a smalleroverall variance, but with no discernable correlation on any scale larger than≈ 100 kb.

The correlation of gene with gene is observed to be larger than the correla-tion of intergene with intergene (intergene being defined as a region bracketingthe midpoint between every two consecutive genes), with the gene-intergenecorrelation being intermediate. If a causal interpretation is correct, then thiscan be taken as evidence that multiple genes are affected by a single (distant)flanking region, and that genes are able to adapt their preferences to the large-scale C +G environment. Specifically, if nearby genes can co-evolve to becomemutually tolerant of high C + G, then the overall fitness of the organism isincreased. In more general terms, the larger gene-gene correlation is additionalevidence that genes played a lead role in isochore formation.

DISCUSSION

If isochores were the result of the accumulation of neutral changes, and notadaptation to selection, then they might be explained by merely a mechanismthat favors C and G, such as mutation bias or biased gene conversion (BGC).However, if isochores are a product of natural selection, as the results of this

17

Page 18: Isochores as Extreme Cases of Genes Cooperatively ...numerical.recipes/whp/biopreprint/draft2obs.pdf · 1Corresponding author: D-1 Group, MS F-600, Los Alamos National Laboratory,

paper strongly suggest, then the mechanism (which could still be BGC, ormerely ordinary single base substitution) is less important than the puzzlingquestion: what was the selection pressure? To be viable, an explanatory theorymust be consistent with all of the following observations, from this work andthe previous literature:

(i) Role of genes. Gene locations are special in isochores. Genes, bothAT-rich and CG-rich, have more extreme compositions than their immediateintergenic surround. A theory must explain Figure 6.

(ii) Strength. The positive selection pressure for CG richness is strongenough to overcome the negative selection effects of large numbers of non-synonymous codon changes, resulting in many amino acid substitutions thatmake little a priori biochemical sense.

(iii) Large scale correlation. The isochore selection pressure must be fo-cused on genes, but it must also be correlated over tens of megabases (Figure9).

(iv) Composition asymmetry. CG isochores contain many AT-rich genes,while AT isochores contain few CG-rich genes. Not unrelated, CG isochoreshave a larger compositional variance on all scales than do AT isochores. Fishgenes have AT or CG preferences (Figure 7), but they are symmetric. Humangene preferences (Figure 7) are (or were) asymmetric.

(v) Gene functional correlations. Genes in CG isochores show a correlationbetween AT-richness and GO function. Genes in AT isochores don’t show sucha correlation. On average, however, genes in the two isochores appear to havethe same mixture of GO functions.

(vi) Spatial broken symmetry. How was any specific large region selectedto become a CG isochore, or not so selected?

(vii) Gene-gene correlation. The composition of nearby genes is morestrongly correlated than the composition of nearby intergenic regions.

We are far from having a theory that can explain these regularities. In itsabsence, however, we can construct a hypothetical story, with likely exagger-ated assumptions, that shows how some of the regularities may be related toothers.

We start with a genome in a natural state of relative AT richness, as isseen in almost all animals except warm-blooded vertebrates. We suppose thata population of genes, about half, depend critically on regulatory or other func-tional mechanisms that depend on this “universal” AT richness. An exampleof such a mechanism may be regulation by microRNAs (Robins and Press

2005). Genes that do, or don’t, require AT richness are randomly distributedon the genome.

Something now happens that dramatically increases the fitness of genesthat are in large genomic regions of CG richness – but only those genes that are

18

Page 19: Isochores as Extreme Cases of Genes Cooperatively ...numerical.recipes/whp/biopreprint/draft2obs.pdf · 1Corresponding author: D-1 Group, MS F-600, Los Alamos National Laboratory,

not dependent on AT-rich machinery. The degree to which a gene is coupled toits environment by selection pressure falls off only slowly with distance. In thecontext of a Wright-Fisher model (Ewens 2004) we might imagine that eachgene j contributes an additive fitness to a proposed mutation at nucleotideposition i, so that the total fitness for such a mutation is

Si = −S0 +∑

genes j

|xi − xj |−α

{+ε for mutation to C or G

−ε for mutation to A or T,(5)

where α is an exponent in the range of 0.5 ± 0.1 and ε is small incrementalfitness constant. The constant −S0 represents some average preference for thestatus quo of high AT. Exponents near 0.5 are special in this idealized model:they give fitness changes Si that are neither dominated by nearby genes nor bydistant genes. The quantity Si therefore has variations on all scales, includinglarge regions where it is negative (which remain AT isochores) or positive(which become CG isochores).

This kind of long-tailed fitness model may provide a unified explanationfor properties (i), (iii), (iv), (v), and (vi) above. The long fitness tails aroundgenes are constructed to explain (i) and (iii). They explain (vi) by amplifyingthe random large-scale fluctuations inherent in the random placement of genes.Properties (iv) and (v) are explained by the fact that mutation away from arelatively constant background of AT richness is suppressed in AT isochores(where Si is negative); but in CG isochores, genes that require AT richness areoften able to dominate small islands around themselves.

Property (vii) is not explained by this story, or any story that places genesrandomly and then holds their AT- or CG-preference fixed. However, it iseasy to imagine extending the model so as to allow (some) genes gradually tochange their preferences. Such changes will be selected for by properties of themodel already given.

Only property (ii), the strength of the pressure, remains completely mys-terious.

One can imagine several biological mechanisms that might give rise tolong-tailed models, though we do not consider any of them compelling. Anyprotein-complex machine that favors CG richness and that moves linearly alongthe genome, as in transcription or replication, will be more efficient if it fallsoff, or needs reconstitution, less often. In such a case, the long-distance tailmay reflect the probability that an assembled protein machine is able to gothat distance. Alternatively, one might obtain a long tail from the geometryof euchromatin packing in the nucleus. The long tail might here reflect theprobability of being packed within some shorter three-dimensional radius ofinfluence. A third possibility is that gene-denser CG-rich regions require (or

19

Page 20: Isochores as Extreme Cases of Genes Cooperatively ...numerical.recipes/whp/biopreprint/draft2obs.pdf · 1Corresponding author: D-1 Group, MS F-600, Los Alamos National Laboratory,

allow) different mechanisms for nucleosome unpacking, and that these mecha-nisms are coherent over long distances.

Some completely different kinds of models would seem to be ruled out bythe properties listed above. For example, an otherwise attractive explanationof (iii) and (vi) is that isochores start from rare seeds and then extend them-selves at their boundaries. Some versions of BGC and recombination hotspot(IHMC 2005) models may invoke this mechanism. However, it seems hard toreconcile such models with properties (i) and (vii): If genes attract hotspots,then why should isochores extend only from existing boundaries? If genes repelhotspots (e.g., by negative selection on modifying their functional elements),then how are (i) and (vii) to be explained?

Acknowledgments: The authors thank Arnold Levine, Gerald Joyce, andAlistair McGregor. This work was supported in part by the Shelby White andLeon Levy Initiatives Fund.

References

Belle, E., L. Duret, N. Galtier and A. Eyre-Walker, 2004 TheDecline of Isochores in Mammals: An Assessment of the GC ContentVariation Along the Mammalian Phylogeny. J Mol Evol 58: 653–660.

Belle, E., N. Smith and A. Eyre-Walker, 2002 Analysis of the Phyloge-netic Distribution of Isochores in Vertebrates and a Test of the ThermalStability Hypothesis. J Mol Evol 55: 356–363.

Bernardi, G., 2000 Isochores and the evolutionary genomics of vertebrates.Gene 241: 3–17.

Bernardi, G., B. Olofsson, J. Filipski, M. Zerial, J. Salinas et al.,1985 The mosaic genome of warm-blooded vertebrates. Science 228: 953–958.

Betts, M. J. and R. B. Russell, 2003 Amino Acid Properties and Con-sequences of Substitution. In M. R. Barnes and I. C. Gray, editors,Bioinformatics for Geneticists, Wiley, New York, 289–316.

Clay, O., S. Caccio, Z. Zoubak, D. Mouchiroud and G. Bernardi,1996 Human coding and noncoding DNA: compositional correlations. MolPhyl Evol 5: 2–12.

Cohen, N., T. Dagan, L. Stone and D. Graur, 2005 GC composition ofthe human genome: in search of isochores. Mol Biol Evol 22: 1260–1272.

20

Page 21: Isochores as Extreme Cases of Genes Cooperatively ...numerical.recipes/whp/biopreprint/draft2obs.pdf · 1Corresponding author: D-1 Group, MS F-600, Los Alamos National Laboratory,

Duret, L., M. Semon, G. Piganeau and D. Mouchiroud, 2002 VanishingGC-rich isochores in mammalian genomes. Genetics 162: 1837–1847.

Ewens, W. J., 2004 Mathematical Population Genetics I. Theoretical Intro-duction. Springer, New York.

Eyre-Walker, A. and L. D. Hurst, 2001 The evolution of isochores. NatRev Gen 2: 549–555.

Hamada, K., H. Tokumasa, H. Ota, K. Mizuno and T. Shinozawa,2003 Presence of isochore structure in reptile genomes suggested by therelationship between GC contents of intron regions and those of codingregions. Genes Genet. Syst. 78: 195–198.

Hughes, S., D. Zelus and D. Mouchiroud, 1999 Warm-blooded isochorestructure in Nile crocodile and turtle. Mol Biol Evol 16: 1521–1527.

IHGSC, 2001 International Human Genome Sequencing Consortium: Initialsequencing and analysis of the human genome. Nature 409: 860–921.

IHMC, 2005 International HapMap Consortium: A haplotype map of thehuman genome. Nature 437: 1299–1320.

Levy, J., R. R. M. S. Zolotukhin and C. J. Link, 1996 Retroviral transferand expression of a humanized, red-shifted green fluorescent protein geneinto human tumor cells. Nat. Biotechnol. 14: 610–614.

Pavlicek, A., J. Paces, O. Clay and G. Gernardi, 2002 A compactview of isochores in the draft human genome sequence. FEBS Letters511: 165–169.

Robins, H. and W. H. Press, 2005 Human microRNAs target a functionallydistinct population of genes with AT-rich 3′ UTRs. PNAS 102: 15557–15562.

Rybicki, G. B. and W. H. Press, 1992 Interpolation, Realization, and Re-construction of Noisy, Irregularly Sampled Data. Astrophys J 398: 169–176.

Smith, N. and A. Eyre-Walker, 2001 Synonymous Codon Bias Is NotCaused By Mutation Bias in G+C-Rich Genes in Humans. Mol Biol Evol18: 982–986.

Vinogradov, A. E., 2003 Isochores and tissue-specificity. Nucleic Acids Res31: 5212–5220.

21

Page 22: Isochores as Extreme Cases of Genes Cooperatively ...numerical.recipes/whp/biopreprint/draft2obs.pdf · 1Corresponding author: D-1 Group, MS F-600, Los Alamos National Laboratory,

Wells, K., J. Foster, K. Moore, V. Pursel and R. Wall, 1999 Codonoptimization, genetic insulation, and an rtTA reporter improve perfor-mance of the tetracycline switch. Transgenic Res. 8: 371–381.

Zolotukhin, S., M. Potter, W. Hauswirth, J. Guy and N. Muzy-

czka, 1996 A ’humanized’ green fluorescent protein cDNA adapted forhigh-level expression in mammalian cells. J. Virology 70: 4646–4654.

22


Recommended