Our reference: BIO 3360 P-authorquery-v9
AUTHOR QUERY FORM
Journal: BIO Please e-mail or fax your responses and any corrections to:
E-mail: [email protected]
Article Number: 3360 Fax: +353 6170 9272
Dear Author,
Please check your proof carefully and mark all corrections at the appropriate place in the proof (e.g., by using on-screenannotation in the PDF file) or compile them in a separate list. Note: if you opt to annotate the file with software other thanAdobe Reader then please also highlight the appropriate place in the PDF file. To ensure fast publication of your paper pleasereturn your corrections within 48 hours.
For correction or revision of any artwork, please consult http://www.elsevier.com/artworkinstructions.
Any queries or remarks that have arisen during the processing of your manuscript are listed below and highlighted by flags inthe proof. Click on the ‘Q’ link to go to the location in the proof.
Location in Query / Remark: click on the Q link to goarticle Please insert your reply or correction at the corresponding line in the proof
The reference given here is cited in the text but is missing from the reference list – please make thelist complete or remove the reference from the text: “Michel (2008)”, “Seligmann (2001)”, “Seligmann(2012g)”, “Seligmann (2003)”.
Q1 Please confirm that given name and surname have been identified correctly.Q2 Please check the address for the corresponding author that has been added here, and correct if necessary.Q3 Ref. “Seligmann (2003)” is cited in the text but not provided in the reference list. Please provide it in
the reference list or delete this citation from the text.Q4 Ref. “Seligmann (2012g)” is cited in the text but not provided in the reference list. Please provide it in
the reference list or delete this citation from the text.Q5 Ref. “Seligmann (2001)” is cited in the text but not provided in the reference list. Please provide it in
the reference list or delete this citation from the text.Q6 Ref. “Michel (2008)” is cited in the text but not provided in the reference list. Please provide it in the
reference list or delete this citation from the text.Q7 Please update references: Seligmann (in press-a, in press-b).
Please check this box if you have nocorrections to make to the PDF file
Thank you for your assistance.
Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotideexchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013),http://dx.doi.org/10.1016/j.biosystems.2013.01.011
ARTICLE IN PRESSG Model
BIO 3360 1–19
BioSystems xxx (2013) xxx– xxx
Contents lists available at SciVerse ScienceDirect
BioSystems
journa l h o me pa g e: www.elsev ier .com/ locate /b iosystems
Polymerization of non-complementary RNA: Systematic symmetricnucleotide exchanges mainly involving uracil produce mitochondrialRNA transcripts coding for cryptic overlapping genes
1
2
3
Hervé Seligmanna,b,∗Q14
a National Natural History Museum Collections, The Hebrew University of Jerusalem, 91904 Jerusalem, Israel5b Department of Life Sciences, Ben Gurion University, 84105 Beer Sheva, Israel6
7
a r t i c l e i n f o8
9
Article history:10
Received 24 October 201211
Received in revised form 24 January 201312
Accepted 29 January 201313
14
Keywords:15
Expressed sequence tags16
Nucleotide misinsertion17
Human DNA polymerase gamma18
Genome compression19
Antitermination tRNA20
Termination codon21
a b s t r a c t
Usual DNA→RNA transcription exchanges T→U. Assuming different systematic symmetric nucleotideexchanges during translation, some GenBank RNAs match exactly human mitochondrial sequences(exchange rules listed in decreasing transcript frequencies): C↔U, A↔U, A↔U+C↔G (two nucleotidepairs exchanged), G↔U, A↔G, C↔G, none for A↔C, A↔G+C↔U, and A↔C+G↔U. Most unusual transcriptsinvolve exchanging uracil. Independent measures of rates of rare replicational enzymatic DNA nucleotidemisinsertions predict frequencies of RNA transcripts systematically exchanging the corresponding misin-serted nucleotides. Exchange transcripts self-hybridize less than other gene regions, self-hybridizationincreases with length, suggesting endoribonuclease-limited elongation. Blast detects stop codon depletedputative protein coding overlapping genes within exchange-transcribed mitochondrial genes. Thesealign with existing GenBank proteins (mainly metazoan origins, prokaryotic and viral origins under-represented). These GenBank proteins frequently interact with RNA/DNA, are membrane transporters,or are typical of mitochondrial metabolism. Nucleotide exchange transcript frequencies increase withoverlapping gene densities and stop densities, indicating finely tuned counterbalancing regulation ofexpression of systematic symmetric nucleotide exchange-encrypted proteins. Such expression necessi-tates combined activities of suppressor tRNAs matching stops, and nucleotide exchange transcription.Two independent properties confirm predicted exchanged overlap coding genes: discrepancy of thirdcodon nucleotide contents from replicational deamination gradients, and codon usage according to cir-cular code predictions. Predictions from both properties converge, especially for frequent nucleotideexchange types. Nucleotide exchanging transcription apparently increases coding densities of proteincoding genes without lengthening genomes, revealing unsuspected functional DNA coding potential.
© 2013 Published by Elsevier Ireland Ltd.
1. Introduction22
The question ‘why are there several stop codons?’ (Krizek23
and Krizek, 2012) has an apparently satisfying answer: off frame,24
protein coding genes include numerous stops (Seligmann and25
Pollock, 2004a,b; Singh and Pardasani, 2009; Tse et al., 2010) which26
decrease protein synthesis costs due to unprogrammed ribosomal27
slippage (Seligmann, 2007, 2010a; Warnecke and Hurst, 2011). In28
addition, the genetic code’s codon–amino acid assignments maxi-29
mize off frame stop numbers (Itzkovitz and Alon, 2007), and third30
codon positions that are part of off frame stops tend to mutate31
less than comparable positions (Seligmann, 2012a). However, this32
explanation hides a further function that stop codons play in off33
∗ Correspondence address: National Natural History Museum Collections, TheQ2Hebrew University of Jerusalem, 91904 Jerusalem, Israel.
E-mail address: [email protected]
frame sequences: it seems that when antitermination (suppres- 34
sor) tRNAs are active in translation, the regular genetic code is de 35
facto transformed into another, stopless genetic code (Seligmann, 36
2010b). Translating sequences into proteins according to that over- 37
lapping code reveals numerous previously undetected genes and 38
proteins, their number coevolving with capacities of antitermina- 39
tion tRNAs (tRNAs with anticodons matching stops) to translate the 40
stops they include (Faure et al., 2011; Seligmann, 2011a, 2012a,b). 41
Inclusion of stops codons in the regular genetic code enables a 42
double coding system, based on the same sequences, and whose 43
expression is efficiently regulated by the presence or absence of 44
suppressor (antitermination) tRNAs. That way, numbers of coded 45
proteins can be high while keeping a relatively short genome, by 46
switching from the regular genetic code to a stopless code. 47
Genome length is an important factor limiting replication 48
and cellular multiplication rates, apparently affecting also devel- 49
opmental rates of metazoan organisms (Sessions and Larson, 50
1987; Gregory and Hebert, 1999; Chipman et al., 2001). Ample 51
0303-2647/$ – see front matter © 2013 Published by Elsevier Ireland Ltd.http://dx.doi.org/10.1016/j.biosystems.2013.01.011
Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotideexchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013),http://dx.doi.org/10.1016/j.biosystems.2013.01.011
ARTICLE IN PRESSG Model
BIO 3360 1–19
2 H. Seligmann / BioSystems xxx (2013) xxx– xxx
data suggest that even at the level of single amino acids, pro-52
tein sequences minimize metabolic synthesis costs (Akashi andQ353
Gojobori, 2002; Seligmann, 2003; Barton et al., 2010), notably of54
cognate amino acids (Perlstein et al., 2007; Alves and Savageau,55
2005; Seligmann, 2012b). Protein length reduction apparently fol-56
lows similar principles (Brocchieri and Karlin, 2005; Warringer57
and Blomberg, 2006; Seligmann, 2012b). Considering this, it is58
very probable that similar forces decrease genome length. Accord-59
ingly, there would be a strong advantage for being able to code60
for more proteins, while keeping the genome short, a phenomenon61
that increases coding density by coding compression, such as over-62
lapping genes, including those induced by antitermination tRNA63
activity (Seligmann, 2011a, 2012c,in press-a; Faure et al., 2011).64
Recent analyses suggest that mitochondrial genomes include sev-65
eral overlapping genes coded in the 3′-to-5′ direction of regular66
protein coding genes, apparently expressed upon putative ‘inver-67
tase’ activity, which would invert the sequence polymerized into68
RNA in the 3′-to-5′ direction (Seligmann, 2012d). A further mech-69
anism apparently increasing coding density is that of protein70
coding genes based on tetracodons, quadruplet codons recognized71
by (among others) tRNAs with expanded anticodons (Seligmann,72
2012e). Mitochondrial genes for ribosomal RNAs seem also to73
include overlapping protein coding genes (Seligmann, 2012g)Q474
It is in this context that a group of phenomena called RNA75
recoding is considered here. These imply typically changing frames76
(Namy et al., 2005) and various phenomena of exon/intron reshuf-77
fling (i.e., Jin et al., 2007; Lev-Maor et al., 2007). In some cases,78
recoding alters the nucleotides used, such as adenosine-to-inosine79
RNA editing (Reenan, 2005; Paz et al., 2007; Daniel et al., 2011).80
1.1. Nucleotide exchanges as a working hypothesis for cryptic81
overlapping genes82
The systematic ‘recoding’ of T (thymidine) to U (uracil) in tran-83
scription from DNA to RNA is also a type of recoding, by DNA→RNA84
polymerases that systematically exchange T by U, and U by T for85
reverse transcriptases. This suggests the hypothesis that coding86
density might be increased by other types of systematic nucleotide87
exchanges, i.e. A by C and C by A (or any other symmetric exchange88
of this type). The fact that during regular DNA replication, ribonu-89
cleotides are frequently inserted instead of deoxynucleotides by90
the mitochondrial DNA gamma polymerase (Kasiviswanathan and91
Copeland, 2011) indicates that polymerases have some flexibility92
in that respect. Misinsertion of non-complementary nucleotides is93
also a basic property of polymerase (mis)function (Lee and Johnson,94
2006). The possibility of polymerase activity implying systematic95
misinsertions, producing non-complementary DNA and/or RNA96
strands, cannot be excluded.97
Such recoded RNA, based on the template of regular DNA98
sequence, could code for additional protein coding gene(s). Inter-99
estingly, if this occurs at DNA level, this could be a mechanism for100
producing new genes, but in this case the assumed mechanism of101
transcription exchanging between nucleotides implies that genes102
code according to ‘direct’ (non-exchanging) and exchange tran-103
scription. In some ways, the former can be seen as explicit, and104
the latter as implicit coding, nevertheless, both levels would be105
inherent simultaneously to the gene’s primary structure.106
Hence if such nucleotide exchanging activity exists, by some107
kind of unknown or modified DNA→RNA polymerases during RNA108
polymerization or editing, inducing such activity might unleash a109
very large coding potential, enabling to code for proteins without110
increasing genome size. In addition, this system implies very sim-111
ple regulation, as each set of genes associated with a given type112
of nucleotide exchange would be induced by the expression of its113
specific ‘nucleotide exchanger’ polymerase/editing activity.114
Table 1The nine different RNA sequences produced from transcription of a single DNAsequence (ACGT) according to the nine types of symmetric nucleotide exchangerules. The amino acid coded by the three first nucleotides according to the ver-tebrate mitochondrial genetic code is also indicated, as well as the percentage ofnucleotides that remain identical after that type of exchange transcription.
Exchange rule Initial DNA5′-ACGT-3′
Codonfor Thr
Similarity toinitial DNAsequence
A↔C 5′-CAGU-3′ Gln 50%A↔G 5′-GCAU-3′ Ala 50%A↔U 5′-UCGA-3′ Ser 50%C↔G 5′-AGCU-3′ Ser 50%C↔U 5′-AUGC-3′ Met 50%G↔U 5′-ACUG-3′ Thr 50%A↔C and G↔U 5′-CAUG-3′ His 0%A↔G and C↔U 5′-GUAC-3′ Val 0%A↔U and C↔G 5′-UGCA-3′ Cys 0%
In total, considering only the four usual nucleotides, nine sym- 115
metric nucleotide exchanges are possible, multiplying by nine the 116
coding potential of any single sequence. Six of these involve only 117
two types of nucleotides (A↔C, A↔G, A↔U, C↔G, C↔U, G↔U) 118
and three all four types of nucleotides, implying two symmet- 119
ric exchanges (A↔C+G↔U, A↔G+C↔U, and A↔U+C↔G). Table 1 120
shows the different RNA sequences produced by each of these rules 121
from a single, given initial DNA sequence. Note that this procedure 122
alters at least 50% of the nucleotides in the initial sequence used in 123
Table 1, and that the amino acid coded by the three first nucleotides 124
in that sequence is changed in almost all cases after systematic 125
symmetric nucleotide exchange. 126
Along the same lines, asymmetric nucleotide recodings are also 127
possible (such as an exchange rule including three nucleotide 128
exchanges, i.e., A→C, C→G and G→A, in total 14 asymmetric 129
exchange possibilities exist (including also rules with four asym- 130
metric nucleotide exchanges). For practical reasons, I explore here 131
only symmetric exchanges Separating symmetric from asymmet- 132
ric exchanges is also justified by the possibility that symmetric 133
and asymmetric nucleotide exchanges may depend upon different 134
types of polymerization (or editing) mechanisms. 135
First, I explore GenBank’s EST (expressed sequence tags) RNA 136
databank for sequences matching the ‘exchanged’ human mito- 137
chondrial genome according to each of the nine symmetric 138
exchange rules and report the results for the various types of 139
exchanges. Then Blast alignment analyses explore whether RNA 140
recoded by each of these exchanges could be coding for pro- 141
teins, using various bioinformatics methods to indicate whether 142
the detected putative overlapping genes seem functional or not. 143
A meta-analysis of the data shows that frequencies of RNAs 144
associated with the different types of symmetric exchanges are 145
proportional to the bioinformatics estimations of overlap protein 146
coding gene functionalities, indicating that coding compression 147
through RNA exchange/editing occurs, and this at different fre- 148
quencies for different types of nucleotide exchanges. Most notably, 149
DNA nucleotide misinsertion rates during replication predict rates 150
of nucleotide exchanging RNA transcription. 151
2. Materials and methods 152
2.1. Sequence manipulations and alignments with existing RNA transcripts 153
All analyses are done for GenBank’s reference complete human mitochondrial 154
genome (NC 012920). Its entire sequence is copy pasted from GenBank into a blank 155
Microsoft Word file. In ‘Word’, the sequence of the genome was altered by using the 156
software’s ‘Replace’ function, mimicking a putative systematic nucleotide exchange. 157
For example, for the symmetric exchange rule A↔C, the function ‘Replace’ was used 158
to replace all ‘A’s in the genome by ‘X’, then all ‘C’s by ‘A’, and then all ‘X’s by ‘C’. The 159
intermediate stage using ‘X’ (or any other arbitrary symbol differing from the four 160
letters used to symbolize the four nucleotides) is necessary to avoid that ‘A’s changed 161
into ‘C’s at the first step are changed back into ‘A’ at the second step. The resulting 162
Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotideexchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013),http://dx.doi.org/10.1016/j.biosystems.2013.01.011
ARTICLE IN PRESSG Model
BIO 3360 1–19
H. Seligmann / BioSystems xxx (2013) xxx– xxx 3
sequence where all ‘A’s present in the initial genome are replaced by ‘C’, and all ‘C’s163
in the initial genome are replaced by ‘A’, is copy/pasted from Word into GenBank’s164
online alignment software ‘Blastn’. Blastn is then requested to search, according165
to standard default search parameters, for RNA sequences from its ESTs database166
and matching that altered sequence, the human mitochondrial genome assuming167
systematic symmetric nucleotide exchange A↔C. This procedure combining Word168
and Blastn is repeated for each of the nine systematic symmetric exchange rules in169
Table 1.170
2.2. Prediction of secondary structure171
Mfold (Zuker, 2003) is used to predict secondary structure formation. This is172
done for the complete (exchanged) sequence of genes for which exchange trans-173
cripts are detected. Mfold’s output presents the secondary structures that are within174
5% of the optimal (most stable) secondary structure. The number of secondary struc-175
tures in which a site does not participate in self-hybridization is indicated by Mfold’s176
‘ss-number’. This number is averaged across all nucleotides, divided by the total177
number of secondary structures predicted by Mfold. This number represents the178
average ‘loopiness’ (tendency to form loops) for that RNA sequence. It is calculated179
separately for regions belonging to RNA transcripts that are transcribed accord-180
ing to a systematic nucleotide exchange rule, and for the rest of that gene. The181
difference between the latter loopiness and the loopiness of the region that has182
been transcribed by symmetric nucleotide exchange estimates the loopiness of the183
exchange transcribed region as compared to the rest of the gene, assuming it were184
also exchange transcribed. Potentially, this subtraction can indicate whether regions185
that were transcribed according to nucleotide exchange rules differ in secondary186
structure formation propensities from other regions of that gene.187
2.3. Candidate overlapping genes detected by Blastp alignments188
In order to investigate potential protein coding by nucleotide exchange, I trans-189
lated into putative protein sequences all six frames of all 13 human mitochondrial190
protein coding genes, after exchanging nucleotides along each of the 9 exchang-191
ing rules. Translation of exchanged RNAs was done by the online available software192
‘transeq’ at the EMBL-EBI site (http://www.ebi.ac.uk/Tools/st/), according to the reg-193
ular vertebrate mitochondrial genetic code, inserting asterisks (*) where stop codons194
occur in the exchange transcribed sequence. Hence putative proteins do not deter-195
mine the identities of amino acids inserted where stop codons occur in the exchange196
transcribed RNA. For any single sequence, a total of 6 × 9 = 54 hypothetical protein197
sequences were produced across frames and exchange rules for each protein coding198
gene, and a total of 13 × 54 = 702 hypothetical protein sequences for the 13 regular199
protein coding sequences in the human mitochondrial genome were examined.200
These 702 hypothetical protein sequences were analyzed by GenBank’s Blastp201
(Altschul et al., 1997, 2005) using standard default parameters of Blastp as has been202
used and described in previous publications (Seligmann, 2011a, 2012c,d,e, in press-203
a, in press-b). Blastp indicates whether the putative peptide is similar to proteins204
existing in GenBank. It produces a homology hypothesis that indicates the candidate205
overlapping genes coded after nucleotide exchange transcription.206
2.4. Duration spent single stranded by DNA during replication by mitochondrial207
protein coding genes208
Some analyses below describe patterns in nucleotide contents due to repli-209
cational deamination gradients along a gradient of duration of DNA single210
strandedness during DNA replication. This is because single strandedness increases211
A→G and C→T deamination rates more than it increases the opposite mutations212
G→A and T→C (Fredrico et al., 1990).213
These spontaneous mutations are counterselected at coding sites, but have214
detectable effects on nucleotide contents at third codon positions in protein coding215
genes (Krishnan et al., 2004a,b; Seligmann et al., 2006). Third codon positions usually216
have also an additional coding role when a sequence is involved in overlap coding.217
Systematic nucleotide exchanges may reveal such overlapping protein coding genes.218
The replicational gradient should not be detectable in such overlap coding regions219
if these are functional (Seligmann, 2012a,d). Hence analyses of effects of replica-220
tional gradients on nucleotide contents at third codon positions should highlight221
the coding status of overlapping genes.222
For that purpose, durations spent single stranded during replication are calcu-223
lated for each human mitochondrial protein coding gene, using the genes midpoint224
location along the genome. Duration spent single stranded by a site is a function of225
the distance of that site from the heavy strand replication origin (OH) and the light226
strand replication origin (OL). This duration spent single stranded is 2 × b/N for the227
genes ND1 and ND2 (genes located between the OH and OL), where b is the midlo-228
cation of the genes in the number of nucleotides counted from the OH, in the 5′→3′229
direction, of the genome’s heavy strand, and N is the total genome length. Note that230
standard mitochondrial genome annotations in GenBank indicate the numberings231
according to the light strand, which may cause some confusions in calculating dura-232
tions spent single stranded during replication (Tanaka and Ozawa, 1994; Raina et al.,233
2005; Seligmann et al., 2006). For the other genes, replicational single strandedness234
is 2 × (OL − b)/N, where OL indicates the midlocation of the light strand replicational235
origin, according to heavy strand numbering.
2.5. Circular code analyses 236
The circular code theory indicates that a set of 20 autocomplementary codons 237
(the 20 codons include the inverse complement of each of these codons) is overrep- 238
resented in the coding frame of regular protein coding genes (Arqués and Michel, 239
1996, 1997). Coded alphabetical communication in human languages consists typi- 240
cally of letters forming words, and of punctuation symbols (comma, question mark, 241
etc). Besides stop codons, in the genetic code, codons coding for amino acids appar- 242
ently have also ‘punctuation’ roles: the circular code codons apparently regulate the 243
reading frame, as suggested also by circular code properties of ribosomal RNA that 244
interacts with the mRNA (Michel, 2012). 245
It seems that when more than one frame in a sequence is coding, the statisti- 246
cal property of overrepresentation of circular code codons is lost, perhaps because 247
‘punctuation’ signals of several frames are mixed, or inexistent to facilitate pas- 248
sage between frames. On the other hand, homopolymer codons (AAA, CCC, GGG, 249
TTT), which tend to cause frameshifts (one of the main mechanisms for switching 250
between coding frames) are relatively overrepresented in overlap coding regions 251
(Ahmed et al., 2007; Ahmed and Michel, 2011). 252
Sequences solely composed by these 20 codons have a non-redundant feature: 253
if nucleotide triplets are not read according to the frame of the codons that compose 254
the sequence, one will soon find a codon that is not part of the initial set of 20 codons, 255
indicating that the reading frame is ‘incorrect’. This lack of redundancy between 256
frames is one of the characteristics of circular codes, and could be related to the 257
reason why circular code codons are underrepresented in overlapping genes. Hence 258
the proportion of homopolymers among the sum of homopolymers and circular code 259
codons should be greater in predicted overlap coding sequences than in adjacent 260
regions of a gene. Statistical confirmation of this prediction by sequence data should 261
be considered as consisting independent evidence for the function of that sequence 262
as overlap coding, in this case after systematic symmetric nucleotide exchange. 263
Note that the natural circular code is characterized by a set of 20 autocomple- 264
mentary codons associated with each frame, where the 20 circular codons of frames 265
+1 and +2 are produced by specific permutations of the nucleotides in the circular 266
codons in frame ‘0’ (frame +1: the first nucleotide in frame ‘0’ is permuted to the 267
third position; frame +2: the third nucleotide in the frame ‘0’ circular codon is per- 268
muted to first codon position, producing the circular code codon of frame +2). None 269
of these three sets of 20 circular codons includes any of the four homopolymers. 270
Hence tests performed here are on averages of homopolymer/circular code propor- 271
tions calculated over the three frames for each set of 20 circular code codons (frame 272
0, +1 and +2 circular codes). 273
2.6. Kinetics of nucleotide misinsertions and systematic nucleotide exchanges 274
It is plausible that rates (or frequencies) of the various types of systematic 275
nucleotide exchanges during RNA transcription correspond to known rates of occa- 276
sional nucleotide misinsertions by polymerases. These are not known for the RNA 277
polymerase, but one might use as proxy those known for the human mitochon- 278
drial DNA polymerase gamma (Lee and Johnson, 2006). These kinetic parameters 279
are indicated as kd and kpol, respectively, in Table 2 from Lee and Johnson (2006). 280
For each type of systematic symmetric nucleotide exchange, I averaged the corre- 281
sponding kds, and separately, kpols from Lee and Johnson (2006). For example, for 282
the systematic symmetric nucleotide exchange A↔C, the kd’s averaged were 160 283
(A→C), 540 (C→A), 150 (G→G) and 57 (T→T), resulting in the mean kd for that 284
type of nucleotide exchange of 226.75 �M. One expects that some proportionality 285
exists between these averages with independent estimates of frequencies or rates of 286
nucleotide exchange polymerization. Positive results would be strong confirmation 287
of the working hypothesis, as they would explain observations on transcripts exist- 288
ing in GenBank by independent parameters of DNA misinsertion polymerization 289
kinetics. 290
3. Results and discussion 291
3.1. RNAs in GenBank 292
A priori, there is no evidence that systematic nucleotide 293
exchanges occur, but the large online databases of RNA sequences 294
(expressed sequence tags, EST, in GenBank) allow searching for 295
RNAs that match the assumed exchange-based recoding of regular 296
genes. I explore, for all 9 symmetric nucleotide exchanges pre- 297
sented in Table 1, whether such RNAs exist in the database for the 298
complete human mitochondrial genome. Table 2 presents all RNAs 299
detected by Blastn (Zhang et al., 2000) for GenBank’s EST database 300
that align with some parts of the human mitochondrial genome, 301
after that genome has been recoded according to each of the nine 302
systematic symmetric nucleotide exchanges. 303
There are 51 such RNA sequences originating from 12 indepen- 304
dent studies of RNA expression. No RNA sequence was detected for 305
Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotideexchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013),http://dx.doi.org/10.1016/j.biosystems.2013.01.011
ARTICLE IN PRESSG Model
BIO 3360 1–19
4 H. Seligmann / BioSystems xxx (2013) xxx– xxx
Table 2Human RNA transcripts detected by Blastn in GenBank’s EST database and aligning with human mitochondrial genome sequences after symmetrically exchanging nucleotidesin the sequence. Columns are: 1. exchange nucleotide rule; 2. gene, and DNA strand matching EST transcript; 3. alignment first and last nucleotides; 4. alignment length; 5.similarity (%) between aligning sequences; 6. description of EST; 7. EST entry in GenBank; 9. EST reference.
Sub Gene Loc N Si Origin ID Ref
AG ND1− 687–812 131 77 Similar to NADH1, renal cell tumor AI367501 62 1AG ND5− 1423–1577 165 95 Homo sapiens hypothalamus AV721614 30 2AG 12s− 403–634 138 88 Normalized rat brain AI230934 53 3AG Ser4− 2–69 68 100 Female pectoral muscle after mastectomy AJ574322 50, AJ574357 51 4AU ND1+ 357–851 495 99 Colon AW176957 53 5AU ND1+ 1–231
27–231231210
9793
Colon insColon ins
BF798658 52, BF798678 53 BF798657 50 6
AU ND2+ 202–593 395 99 Head neck, FAPESP/LICR Human CancerGenome Project
AI940581 57 5
AU CO1− 122–634 513 99 Colon, The FAPESP/LICR Human CancerGenome Project
AW176982 50 5
AU ND4+ 852–1240 391 99 Colon CK327105 54 6AU 12s+ 2–268 271 97 Colon ins BF798660 51, BF798653 51, BF798647 (–) 52 6AU 16s+ 1405–1559 156 99 Colon ins BF798658 52, BF798678 53 6AU Leu2+ 1–75 75 100 Colon ins BF798658 52, BF798678 53 6CG AT6+ 486–669 188 83 Thymus BX457166 47 7CG 16s+ 565–715 151 97 Adult heart, female pectoral muscle after
mastectomyN41204 38, AJ574341 36, AJ574283 49 8, 4
CU ND1+ 770–952 183 99 Female pectoral muscle after mastectomy AJ574326 63 4CU CO1+ 1509–1542 34 100 Female pectoral muscle after mastectomy AJ574346 48, AJ574322 45 4CU ND4+ 851–1042 196 95 Prostate normal BF370011 56 6CU ND4+ 1265–1378 114 96 Female pectoral muscle after mastectomy AJ574311 60 4CU CytB+ 852–1133 285 99 Hypothalamus AV722273 59, AJ574321 56, AJ574347 57,
AJ574344 58, AJ574371 55, AJ574334 53,AJ574333 54
2, 4
CU 16s+ 24–401 378 97 Hypothalamus AV722267 45 2CU 16s+ 361–556 196 95 Female pectoral muscle after mastectomy AJ574370 46 4CU 16s+ 411–776 367 96 Hypothalamus, female pectoral muscle after
mastectomyAV721363 48, AJ574341 46, AJ574327 45 2, 4
CU 16s+ 122–382 211 99 Female pectoral muscle after mastectomy AJ574335 40, AJ574332 42 4CU 16s+ 724–937 548 99 Female pectoral muscle after mastectomy AJ574378 49, AJ574291 49 4GU ND1+ 12–377 369 92 Cell line AI525967 49 9GU ND2− 564–902 353 84 Nervous normal BI032899 59 6GU AT6+ 32–150 155 88 Gastric epithelial progenitor Mus musculus,
ATP6CF425368 36 10
GU AT6− 386–536 120 80 Head neck AW381066 61 5GU CytB+ 188–331 147 94 Adult heart AA413440 46 11GU 16s+ 328–780 456 95 Tissue culture AI541277 48 9GU 16s+ 46–423 302 96 Cell line AI525977 60 9GU 16s+ 57–1130 281 80 Human aorta polyA+ mRNA C15855 41 12
1. Strausberg, 1997. National Cancer Institute, Cancer Genome Anatomy Project (CGAP), Tumor Gene Index, unpub.2. Gu, Y., Peng, Y., Song, H., Huang, Q., Yang, Y., Gao, G., Xiao, H., Xu, X., Li, N., Qian, B., Liu, F., Qu, J., Gao, X., Cheng, Z., Xu, Z., Zeng, L., Xu, S., Gu, W., Tu, Y., Jia, J., Fu, G., Ren, S.,Zhong, M., Lu, G., Hu, R., Chen, J., Chen, Z., Han, Z., 2000. Homo sapiens cDNA HTB clones, unpub.3. Lee, N.H., Glodek, A., Chandra, I., Mason, T.M., Quackenbush, J., Kerlavage, A.R., Adams, M.D., 1998. Rat Genome Project: Generation of a Rat EST (REST) Catalog & Rat GeneIndex, unpub.4. Laveder, P., De Pitta, C., Vitulo, N., Valle, G., Lanfranchi, G., 2003. Oligo-directed RNase H cleavage of abundant mRNAs in skeletal muscle, unpub.5. Simpson, A.J.G., 1999 The FAPESP/LICR Human Cancer Genome Project, unpub.6. Dias Neto et al. (2000).7. Li, W.B., Gruber, C., Jessee, J., Polayes, D. 2001. Full-length cDNA libraries and normalization, unpub.8. Lui et al. (1995).9. Huang et al. (1999).10. Tidwell, R., Clifton, S., Marra, M., Hillier, L., Pape, D., Martin, J., Wylie, T., Theising, B., Bowers, Y., Gibbons, M., Ritter, E., Bennet, J., Ronko, I., Tsagareishvili, R., Belaygorod,L., Grow, A., Maguire, L., Waterston, R., Wilson, R., 2002. Unpublished.11. Liew et al. (1994).12. Fujiwara, T., Hirano, H., Katagiri, T., Kawai, A., Kuga, Y., Nagata, M., Okuno, S., Ozaki, K., Shimizu, F., Shimada, Y., Shinomiya, H., Takaichi, A., Takeda, S., Watanabe, T.,Takahashi, E., Hirai, Y., Maekawa, H., Shin, S., Nakamura, Y., 1995, unpub.
three exchange types: the exchange A↔C, and two of the three306
exchange rules involving all four nucleotide types, A↔C+G↔U,307
and A↔G+C↔U. Among nucleotide exchanges involving only two308
nucleotides, most common were RNAs where recoding exchanges309
involve uracil: C↔U (21 sequences), A↔U (14 sequences) and310
G↔U (8 sequences). The systematic exchanges A↔G and C↔G311
were found in 5 RNA sequences each. The exchange involving all312
four nucleotides A↔U+C↔G was quite common (10 sequences,313
data not presented in Table 2) and is analyzed in detail sepa-314
rately (Seligmann, 2012d). It is first of all notable that the three315
most common exchanges are those between uracil and the three316
other nucleotides. Hence uracil, which exchanges thymidine during317
regular transcription, seems also most frequently involved in 318
‘unusual’ exchanging transcription. 319
Blastn analyses detect in total 61 ‘exchanged’ sequences (includ- 320
ing the 10 for the A↔U+C↔G exchange rule not presented in 321
Table 2 (these 10 transcripts are presented in Table 1 from 322
Seligmann, 2012d)). This is 0.56% of the 10899 ESTs annotated 323
as from human mitochondrial origins in GenBank’s database by 324
June 2012. It would be also very interesting in this context to 325
explore the high accuracy transcript data available for the human 326
mitochondrial transcriptome (Mercer et al., 2011a,b). These data, 327
available at http://mitochondria.matticklab.com, are not search- 328
able at this point along the guidelines of nucleotide exchanging 329
Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotideexchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013),http://dx.doi.org/10.1016/j.biosystems.2013.01.011
ARTICLE IN PRESSG Model
BIO 3360 1–19
H. Seligmann / BioSystems xxx (2013) xxx– xxx 5
Fig. 1. Loopiness of transcripts in Table 2 as a function of their relative length. Sec-ondary structure predictions estimate the average number of structures in which theaverage site does not form a stem by self-hybridization in RNA (loopiness), assumingsymmetric exchanging transcription. The y axis is the subtraction of that mean forgene regions that are not within such nucleotide exchanging transcripts, from themean loopiness of the regions transcribed by exchanging transcription and listedin Table 2. The x axis is the relative proportion the exchanging transcript repre-sents from the total length of that gene. Gene identities, and the types of symmetricnucleotide exchange, are indicated next to each datapoint.
RNA transcription, but this database would probably yield addi-330
tional insights into the frequencies of the various types of exchange331
transcriptions.332
3.1.1. Exchanging artifacts333
The various sequences in Table 2 originate from 12 studies of334
RNA, with RNAs matching different types of exchanges originat-335
ing in some cases from the same study, and RNAs matching some336
types of exchanges originating from several studies. These data337
are important material evidence suggesting a family of previously338
undescribed RNA recoding types, a potentially major discovery for339
genomics and molecular biology. For that reason, these sequence340
data are examined along the lines of a number of possible artifacts.341
First, if all or most sequences originated from mainly one study,342
one could have suggested that exchanges were due to specific con-343
ditions associated with that study. Possibly, erroneous sequence344
manipulation, perhaps while incorrectly or only partly inverse345
complementing sequences by semi-automatized methods, could346
have created the sequences in Table 2. For example, the only347
symmetric exchange involving all four nucleotides for which cor-348
responding RNA has been detected (A↔U+C↔G) can result from349
complementing a sequence without inversing the nucleotide order,350
a possible, potential sequence manipulation error that could cre-351
ate the ten BLAST hits matching this exchange rule (which are not352
reported in Table 2). For analyses excluding the possibility of arti-353
facts for these 10 sequences, see Fig. 1 in Seligmann (2012d), which354
shows that their length increases with their relative secondary355
structure formation capacities. Erroneous partial complementing356
(of A↔U or C↔G) could create the RNAs detected and match-357
ing these two additional types of nucleotide exchanges. However,358
these annotation artifacts could not explain the occurrences of RNA359
corresponding to A↔G, C↔U and G↔U. It is most probable that360
the data in Table 2 are not the result of such in silico sequence361
manipulations, especially that as many as 12 studies produced such362
sequences.363
Another possibility is that of a statistical artifact. The exchanged 364
sequences usually exchange between two nucleotides, so they 365
remain identical to the original, regular sequence for the two other 366
nucleotides. Hence on average, half of the nucleotides are being 367
exchanged, expecting a mean similarity between the exchanged 368
RNA and the regular transcript of 50%. However, all (but one) 369
similarities in Table 2 are >80%, and only 7 are below 90%. 370
Nucleotide ratios vary locally, so that high similarities that do 371
not imply exchange transcription are possible because locally, in 372
these sequences, the exchanged nucleotides might have very low 373
frequencies. However, the sequences in Table 2 and their high sim- 374
ilarities are not compatible with extreme local nucleotide biases 375
creating the illusion of exchange transcription: the exchanged 376
nucleotides do never represent less than 30% of the EST sequence, 377
which would yield at best a similarity of 70% with the regular tran- 378
script, assuming that no nucleotide exchange actually occurred and 379
that the RNA reported in Table 2 does not result from exchanging 380
transcription, but from the low local proportion of the exchanged 381
nucleotides in its composition. In fact, all the RNA sequences 382
include all four nucleotides, and this in proportions that seem 383
incompatible with the high similarities observed if no systematic 384
transcriptional exchange occurred (see Table 2, percentages are 385
indicated next to GenBank entries). Hence nucleotide biases did 386
not create false positives for exchanging transcription for the wide 387
majority of transcripts presented in Table 2. Therefore, most data 388
in Table 2 does not result from statistical artifacts involving local 389
nucleotide biases. The specific cases of potential exceptions, three 390
transcripts in Table 2 with low similarities, are examined in some 391
details in a section below. 392
3.1.2. Alternative biological explanations 393
The next potential problems with the nucleotide exchange 394
interpretation of the data in Table 2 are of biological natures. It is 395
possible that regular transcription of other, nuclear DNA sequences, 396
produces the transcripts in Table 2. This possibility cannot be totally 397
ruled out a priori. The RNAs in Table 2 have high and even very 398
high similarities with the mitochondrial sequences after assuming 399
exchanging transcription. This means that if these RNAs are pro- 400
duced by regular transcription of nuclear (or cytosolic) sequences, 401
and not by exchanging transcription of mitochondrial sequences, 402
these nuclear sequences resulted from exchanging reverse tran- 403
scription of mitochondrial RNA into nuclear DNA, or some other 404
exchanging process creating a nuclear mitochondrial (pseudo)gene 405
that involved systematic nucleotide exchanges. 406
Hence even if the RNAs in Table 2 would not be the result 407
of exchanging transcription (=exchanging RNA polymerization), 408
they would reflect exchanging DNA polymerization. Searching with 409
BLAST GenBank’s human genome data does not yield any posi- 410
tive hits for the mitochondrial sequences transformed according 411
to any of the nine symmetric exchanging transcription rules. This 412
negative result does not totally rule out the possibility that regu- 413
lar transcription of nuclear or cytosolic DNA is at the origin of the 414
RNAs in Table 2, but there is no evidence to sustain this possibility 415
at this point. Hence this biological interpretation seems unlikely. 416
However, even if this nuclear origin was true, it would indicate 417
that DNA polymerization following exchanging rules occurs. Such 418
exchanging DNA polymerization would still be important indirect 419
evidence in favor of the working hypothesis that exchanging RNA 420
polymerization occurs, and would be compatible with the exist- 421
ence of protein coding genes within these exchanged sequences. 422
It would be direct evidence for the creation of new genes through 423
nucleotide exchanges. 424
The last considered biological alternative to exchanging tran- 425
scription relates to the fact that all the transcripts in Table 2 426
originate from studies that date from before the year 2004. This 427
suggests that the RNAs might result from rare dysfunctions by the 428
Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotideexchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013),http://dx.doi.org/10.1016/j.biosystems.2013.01.011
ARTICLE IN PRESSG Model
BIO 3360 1–19
6 H. Seligmann / BioSystems xxx (2013) xxx– xxx
reverse transcriptases used in the creation of cDNA libraries from429
extracted RNA transcripts, and which form the EST databases. It is430
indeed possible that such flaws were discovered at some point and431
that EST libraries produced after 2003 are exempt of these flaws. If432
this is the case, the data in Table 2 do not reflect directly exchang-433
ing transcription activity, but exchanging reverse transcription.434
At this stage, the correct interpretation of the data presented in435
Table 2 could be that occasionally natural exchanging transcrip-436
tion occurs, or that occasionally exchanging reverse transcription437
by the enzymes used to create the EST libraries occurs. However,438
even in the latter case, such exchanging reverse transcription would439
be at least indirect evidence that exchanging transcription and440
associated protein coding genes might exist, as RNA and DNA poly-441
merases have great similarities. It is notable in this context that the442
human mitochondrial polymerase gamma, which usually replicates443
the mitochondrial genome, has also reverse transcription activity444
(Kasiviswanathan and Copeland, 2011).445
Even if the RNAs in Table 2 are not of natural origin, but result446
from some kind of dysfunctions by the reverse transcriptase used447
to produce the EST libraries, the frequencies of the various types448
of nucleotide exchanges suggested by the data in Table 2 would449
still be informative from a biological point of view: these dysfunc-450
tion frequencies would probably indicate the occurrence of natural451
dysfunctions of these types. In the next section, analyses of sec-452
ondary structures formed by exchanged transcripts suggest that453
the reverse transcriptase ‘artefact’ is the less likely explanation454
3.1.3. Secondary structure formation by exchanging transcripts455
According to the scenarios described in Section 3.1.2, the data456
in Table 2 would reflect a biological reality, which has 2–3 alterna-457
tive interpretations, but all based on the principle of ‘exchanging’458
polymerization, of RNA on the basis of DNA (transcription), of DNA459
on the basis of DNA (replication), or of DNA on the basis of RNA460
(reverse transcription).461
A further analysis confirms that the sequences in Table 2 reflect a462
biological phenomenon, most probably due to RNA polymerization,463
though transcript edition cannot be ruled out. Using Mfold (Zuker,464
2003), secondary structure formation by gene sequences tran-465
scribed assuming the specific exchanging rule was predicted for466
the complete (exchanged) sequence of genes for which exchange467
transcripts were found (Table 2). I calculated separately for the468
regions belonging to the RNA sequences presented in Table 2, and469
for the rest of that gene the mean of the percentage participa-470
tion in loops for nucleotides. Loopiness of the ‘exchanged RNA’ was471
greater in regions that underwent exchanging transcription accord-472
ing to Table 2 than in the rest of the gene transcribed, assuming473
exchanging transcription for that region (though no such exchange474
transcription was detected for that region), in 23 of the 33 (62%)475
of the sequences for which ‘exchanging’ transcripts were detected.476
Table 2 lists more sequences because in some cases several trans-477
cripts were found matching the same genome region. This slight478
majority is statistically significant according to a one tailed sign test479
(P = 0.047), suggesting that the transcripts produced by exchanging480
transcription tend usually to form less secondary structures than481
the rest of the gene (assuming it was also exchange transcribed).482
This means that the transcripts in Table 2 have a common feature,483
and are not a random sample of potential transcripts.484
The tendency for high loopiness differed among various types485
of exchange transcriptions, it is weakest for A↔U exchanges486
(33%), intermediate for C↔G and G↔U exchanges (50%), 60% for487
A↔U+C↔G exchanges (not in Table 2), and strongest for A↔G and488
C↔U exchanges (100 and 90%, respectively). These differences are489
in no way statistically significant, as the number of cases is too low490
for even considering statistical tests (one transcript for A↔G and491
two for C↔G).492
However, for C↔U, considering that there are 10 cases, a one 493
tailed sign test indicates a statistically significant tendency for 494
greater loopiness in regions that underwent C↔U exchanging tran- 495
scription than in other regions of the same gene, assuming they too 496
had undergone nucleotide exchange transcription along the C↔U 497
rule. For the 10 transcripts following the C↔U rule, the probability 498
of getting 9 among 10 cases where loopiness is greater than for the 499
rest of the gene, yields according to a binomial distribution (the dis- 500
tribution used in sign tests), the statistical significance P = 0.0054. 501
In other words, if one was to assume that loopiness in exchange 502
transcribed regions is as likely to be above as below the loopi- 503
ness in surrounding regions, the probability of getting 9 among 10 504
exchange transcripts with greater loopiness is about half a percent. 505
Hence it is unlikely that exchange transcribed regions have on aver- 506
age the same loopiness as other regions. This tendency indicates 507
that self-hybridization disfavours the production of ‘exchanged’ 508
transcripts. This strengthens the possibility that exchanges result 509
from editing of transcripts, where secondary structure might pre- 510
vent or at least impede editing after polymerization. However, 511
this does not preclude that exchanges occurred during transcrip- 512
tion itself. If RNA polymerization that systematically exchanges 513
nucleotides is relatively slow, it might be particularly impeded by 514
secondary structure formation, and hence loopiness might promote 515
it. 516
3.1.4. The length of exchanging transcripts 517
An additional observation might give a clue on the nature of 518
the process involved, and which relates to the capacity for sec- 519
ondary structure formation by the RNAs in Table 2 in relation to 520
their length: the loopiness of exchanging transcripts, as compared 521
to the rest of the gene sequence, decreases with the relative length 522
of the exchanging transcript (Fig. 1). This suggests that exchange 523
transcription (or edition) is favored by free access to the elon- 524
gating RNA polymer, but that in order for that polymer to reach 525
a sizeable proportion of the total length of the gene, it should 526
form secondary structure. The ‘paradox’ between the requirement 527
that an exchange-transcribed region forms little secondary struc- 528
ture, and the requirement, for its elongation, that it self-hybridizes, 529
could explain why transcripts produced by systematic nucleotide 530
exchanges are rare. 531
I propose in this context the following interpretation. By def- 532
inition, exchanging transcription does not produce RNA that is 533
the inverse complement of its template DNA strand, and hence 534
the elongating RNA does not form a duplex with DNA dur- 535
ing its elongation. As a result, it is single stranded and open 536
to digestion by endoribonucleases, which would shorten them. 537
Hence in order to reach relatively great lengths, polymeriza- 538
tion of non-complementary RNA (or DNA) requires protection 539
by self-hybridization (secondary structure formation), as it can- 540
not be protected by hybridization with existing DNA (or RNA) as 541
for regular transcripts. Regular transcripts are protected by both, 542
hybridization with the ‘maternal’ strand and self-hybridization, but 543
for transcripts produced by exchanging transcription, complemen- 544
tarity is much lower, and hence elongation is much more dependent 545
on protection due to self-hybridization. In the extreme case of the 546
exchange rule that involves two pairs of exchanged nucleotides 547
(A↔U+C↔G), there is no complementarity at all, and protection can 548
only result from self-hybridization. Therefore for the 10 transcripts 549
following that rule, the correlation between relative loopiness and 550
transcript length is much stronger than for the other exchange 551
transcription types: r = −0.65. The association for the rest of the 552
transcripts, in Fig. 1, is much weaker and is only statistically sig- 553
nificant if transcripts are split into two groups, those below and 554
those above the relative length of 20% of the length of their gene. 555
A one tailed Fisher exact test indicates that there are more trans- 556
cripts with more loopiness in the exchange transcribed part than 557
Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotideexchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013),http://dx.doi.org/10.1016/j.biosystems.2013.01.011
ARTICLE IN PRESSG Model
BIO 3360 1–19
H. Seligmann / BioSystems xxx (2013) xxx– xxx 7
the rest of the gene (positive loopiness values on the y axis in558
Fig. 1) for transcripts with relative length <20% than for those with559
relative length >20% (P = 0.052). Excluding the extreme length out-560
lier indicated by a triangle in Fig. 1 yields P = 0.042. Including the561
ten sequences of the A↔U+C↔G exchange type (from Table 1 in562
Seligmann, 2012d), the test yields P = 0.023.563
These negative correlations between exchange transcript564
lengths and loopiness indicate that endoribonucleases (or other565
enzymes with similar activities) are active during exchanging tran-566
script production. This situation is not totally incompatible with the567
possibility that the sequences listed in Table 2 were produced by568
occasional dysfunctional reverse transcription to create the cDNA569
libraries in GenBank, but seems more in line with transcription570
occurring under natural physiological circumstances. Hence the571
working hypothesis that occasionally, various types of exchanging572
transcriptions occur under natural physiological conditions, seems573
the most probable explanation for the data in Table 2, and is not574
incompatible with the alternative explanations that could not be575
totally ruled out (exchanging replication or exchanging reverse576
transcription).577
3.1.5. Putative protein coding genes in exchanging transcripts578
Table 2 presents some data favoring the working hypothesis579
of exchange transcription. The working hypothesis is formulated580
on the basis of an evolutionary principle of minimizing costs due581
to genome size, assuming that overlap coding (which in this case582
results from exchange transcription) increases the number of genes583
and the genome’s coding density without increasing its size. Hence584
evidence confirming that transcripts of protein coding genes after585
systematically exchanging nucleotides potentially include regions586
that code for proteins would strengthen the hypothesis on two587
grounds: first, because it would confirm the basic evolutionary588
principle subjacent to the advantage associated with exchange589
transcription by indicating its role in revealing coding potential;590
and second, because consistent patterns in (exchange transcrip-591
tion) overlap coding genes would be in themselves evidence that592
exchange transcription occurs, independently of physical evidence593
for RNA transcripts presumably produced by exchange transcrip-594
tion (Table 2). In addition, if analyses of coding properties of RNA595
after nucleotide exchange converge with those in Table 2, for exam-596
ple if coding seems more probable for exchange types that are597
relatively more represented in Table 2, and less probable in those598
for which no transcripts were detected, this coherence between dif-599
ferent types of independent data and analyses would, in the context600
of a meta-analysis, be strong evidence for (overlap) protein coding601
based on exchange transcription.602
There are 702 hypothetical peptides for the 13 human mito-603
chondrial protein coding genes. These were analyzed by GenBank’s604
Blastp (Altschul et al., 1997, 2005) and hits with proteins existing in605
GenBank were recorded (Table 3). These analyses produced numer-606
ous hits, from 9 for A↔C exchanges, to 36 for G↔U exchanges, in607
total between 483 codons (for A↔C exchanges) and 2801 codons608
(for A↔G exchanges) putatively involved in overlap coding asso-609
ciated with exchange transcription. It is notable that several hits,610
mainly for exchanges involving transitions C↔U and A↔G, were611
for the frame corresponding with the gene’s regular main frame,612
and with proteins that are homologous to the protein coded by the613
regular main frame of that gene. These cases may be of interest, but614
are excluded from analyses of overlapping genes presented here,615
and also from the statistics on putative overlapping genes at the616
beginning of this paragraph.617
It is notable that the average length of putative alignments618
detected by Blastp for a type of nucleotide exchange is proportional619
to the number of transcripts detected for that type of exchange620
as reported in Table 2 (Pearson parametric correlation coeffi-621
cient r = 0.747, P = 0.0104; Spearman nonparametric correlation622
Fig. 2. Mean length of putative overlapping protein coding genes predicted by Blastpanalyses (from Table 3) as a function of the number of exchanging transcripts accord-ing to that exchange rule (from Table 2). The type of nucleotide exchange assumedby analyses is indicated next to each datapoint, followed by the number of align-ments with GenBank proteins interacting with DNA or RNA, membrane proteins,and proteins with physiological functions typical of mitochondria.
coefficient rs = 0.75, P = 0.0166, one tailed tests, Fig. 2). This result 623
is a type of meta-analysis of the data of all exchange transcription 624
types that is indicative that overall, overlap coding by nucleotide 625
exchange might occur, and this proportionally to the observed fre- 626
quency of exchange transcription. 627
3.1.6. Functions of proteins coded by ‘nucleotide exchange’ 628
encrypted overlapping genes 629
Table 3 suggests that Blastp alignment analyses of the 702 630
peptides translated from the exchange transcribed human pro- 631
tein coding genes detect 168 previously undetected polypeptides 632
putatively coded by exchange overlap coding. These genes were 633
previously undetected. This means that 23.9% of the hypothetical 634
translated sequences (the percentage ranges from 11.5% for A↔C 635
exchanges, to 46.2% for A↔U exchanges) are potentially protein 636
coding. For the sake of comparison, note that for ‘regular’ overlap 637
coding in the same sequences of that species, induced by suppressor 638
tRNA activity, the same Blastp analyses yield 24 putative overlap- 639
ping genes (36.9% of the hypothetical translated sequences from 640
the five alternative frames for the 13 genes, see Seligmann, 2011a). 641
According to Table 3, these putative exchange overlapping genes 642
include alignments with 11 proteins interacting with DNA or RNA 643
(4 for G↔U exchanges, 3 for A↔G as well as A↔U+C↔G exchanges 644
(data not in Table 3 for that exchange type that is analyzed in detail 645
by Seligmann, 2012d), and one for A↔G+C↔U exchanges). These 646
putative overlapping genes might code themselves for protein(s) 647
involved in the production of the exchange transcripts. This could 648
be indicated by the positive correlation between their percent- 649
age within the sample of putative overlapping genes and observed 650
exchange transcript numbers (r = 0.45, not statistically significant 651
even at P < 0.20). 652
Fig. 2 indicates the number of such candidate overlapping genes 653
for each type of nucleotide exchange, the number of predicted 654
membrane proteins, and of proteins with functions frequently asso- 655
ciated with typical mitochondrial metabolism. The latter are most 656
numerous, in total 29 and occur in all nucleotide exchange types 657
(least (one) for A↔U, most (6) for G↔U) and include sequences 658
aligning with an alkyl hyperoxide reductase for G↔U exchange 659
Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotideexchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013),http://dx.doi.org/10.1016/j.biosystems.2013.01.011
ARTICLE IN PRESSG Model
BIO 3360 1–19
8 H. Seligmann / BioSystems xxx (2013) xxx– xxx
Table 3List of GenBank proteins aligning according to Blastp with putative peptide sequences translated from mitochondrial protein coding genes assuming nucleotide exchangingtranscription. Columns are: 1. gene identity and frame (1–3, + strand; 4–6, − strand); 2. first and last amino acids in alignment; 3. alignment length; 4. alignment similarity;5. entry of GenBank protein aligning with peptide translated from exchanging transcription; 6. description of GenBank protein; 7. number of stops in alignments; 8. type ofexchanging transcription.
Gene Loc N Si Id Origin Ter
ND1 1 48–180 139 45 AEQ35744 ND1 Pan troglodytes 5 G-TND1 3 5–98 104 41 AAB39554 Nitrate reductase Agrostemma githago 1 G-TND1 4 49–282 246 35 EFC39457 Hyp. Naegleria gruberi 4 G-TND1 6 188–290 104 49 EGH13632 Exonuclease Pseudomonas syringa 0 G-TND2 1 3–341 353 41 AAL48391 ND2 Homo sapiens 29 G-TND2 5 297–340 47 57 EFZ00925 Hyp. Metarhizium anisopliae 0 G-TND2 6 169–282 126 46 ABM38496 Antitermination Polaromonas naphthalenivorans 0 G-TCO1 6 150–220 73 51 EFQ64751 Hyp. Pseudomonas fluorescens 0 G-TCO1 6 307–460 154 43 XP002804370 Hyp. Macaca mulatta 3 G-TCO2 3 14–107 96 42 XP003230408 Brevican core -like Anolis carolinensis 0 G-TCO2 6 7–35 29 69 EGT47210 Hyp. Caenorhabditis brenneri 0 G-TAT8 1 2–68 67 55 ACR09286 AT8 Homo sapiens 7 G-TAT8 4 1–58 61 46 EGF97967 Hyp. Melampsora larici-populina 0 G-TAT8 4 28–68 40 58 AEE71795 Lipoprotein Propionibacterium acnes 0 G-TAT8 5 1–45 45 64 AAG44787 DC48 Homo sapiens 1 G-TAT8 5 41–59 19 84 EEV38750 Glycosyl transferase I Enterococcus casseliflavus 0 G-TAT8 6 20–51 32 59 BAL01249 Sodium/glutamate symporter Oscillibacter valericigenes 0 G-TAT8 6 7–57 51 51 CAM64590 Hyp. Mycobacterium abscessus 0 G-TAT8 6 14–59 47 55 EEY51301 Glycosyltransferase 2 Bacteroides sp. 2 1 33B 0 G-TAT6 1 11–226 227 43 ADU77956 AT6 Homo sapiens 22 G-TAT6 3 80–149 70 47 AF467769 Glycoprotein precursor Crimean-Congo hemorrhagic fever virus 3 G-TAT6 5 115–170 58 52 EGV19704 Hyp. Thiocapsa marina 0 G-TAT6 6 151–224 74 61 AAG44787 DC48 Homo sapiens 1 G-TAT6 6 9–75 68 49 EED92930 Hyp. Thalassiosira pseudonana 0 G-TAT6 6 115–172 59 53 CCB67068 Hyp. Hyphomicrobium sp. MC1 0 G-TCO3 1 2–158 170 40 ADL31200 CO3 Homo sapiens 12 G-TCO3 3 16–166 119 50 BAB93516 OK/SW-CL.16, Homo sapiens 6 G-TCO3 5 111–169 59 49 EGU76154 Hyp. Fusarium oxysporum 1 G-TCO3 5 103–141 39 64 EDP05187 Hyp. Chlamydomonas reinhardtii 0 G-TND3 4 31–95 65 43 EGW6025 Transcriptional regulator Dechlorosoma suillum 0 G-TND3 6 41–76 36 53 CAJ86300 H0124B04.17 Oryza sativa Indica 0 G-TND4l 3 39–75 37 62 EGU85298 Hyp. Fusarium oxysporum 0 G-TND4l 5 4–92 89 46 EEA93813 Alkyl hyperoxide reductase Pseudovibrio 1 G-TND4l 6 31–92 62 48 BQ76407 Diguanylate cyclase/phosphodiesterase with PAS/PAC sensor Pseudomonas putida 0 G-TND4 1 2–451 463 39 ADL31476 ND4 Homo sapiens 50 G-TND4 3 92–216 125 46 BAC5228 Hyp. Homo sapiens 6 G-TND4 3 137–265 121 44 AAG44628 DC24 Homo sapiens 3 G-TND4 5 246–368 138 42 EAT38723 Hyp. Aedes aegypti 0 G-TND5 1 2–581 614 43 ACU09622 ND5 Homo sapiens 50 G-TND5 6 188–255 71 46 ADH63862 O-Acetylhomoserine/O-acetylserine sulfhydrylase Meiothermus silvanus 1 G-TND6 6 112–172 66 45 CAB07382 Caenorhabditis elegans 0 G-TCytB 6 199–276 78 46 Y86845 Serine esterase family Metarhizium acridum 0 G-TCytB 6 229–293 67 31 ACA19730 Transcriptional regulator Methylobacterium 1 G-TND1 1 21–229 218 42 ACT75317 ND1 Phaeoceros laevis 4 C-TND1 1 229–305 80 46 EEB33262 Hyp. Desulfovibrio piger 1 C-TND2 6 51–93 45 56 ADQ43189 Oligopeptide transporter Eutrema parvulum 0 C-TND2 6 212–319 108 47 AFB2830 Hyp. Rickettsia rickettsii 6 C-TCO1 1 11–395 398 32 ACT75318 CO1 Phaeoceros laevis 4 C-TCO1 6 157–326 171 38 CAZ61577 CO1 Sciadicleithrum variabilum 8 C-TCO1 6 316–504 190 41 AEI55877 CO1 Penicillium polonicum 13 C-TCO2 1 19–124 109 39 ABG40996 Redoxin Pseudoalteromonas atlantica 1 C-TCO2 1 66–149 90 48 EFH6873 Possible ribosomal prot. Clostridium difficile 1 C-TCO3 1 57–250 194 39 ACS71775 CO3 Isoetes engelmannii 7 C-TND3 1 51–106 59 56 EFQ96882 Ankyrin repeat domain-containing Arthroderma gypseum 1 C-TND3 6 11–60 53 58 EFW38918 Efflux ABC transporter, permease Treponema phagedenis 0 C-TND4l 2 3–62 61 49 XP002121272 Hyp. Ciona intestinalis 0 C-TND4l 2 10–49 40 50 ADN36160 Glycosyl transferase Methanoplanus petrolearius 0 C-TND5 6 60–125 78 41 ABF33578 Oligohyaluronate lyase Streptococcus pyogenes 1 C-TND6 1 26–174 149 41 ADT82255 ND6 Hylobates muelleri 0 C-TND6 5 38–114 83 48 EDY73983 GA28377 Drosophila pseudoobscura 1 C-TCytB 1 15–292 281 41 ACI01099 Apocytochrome b Isoetes engelmannii 2 C-TND1 1 1–317 317 58 CAA66304 ND1 Pongo pygmaeus 20 A-GND1 2 161–224 69 46 EEU48753 Hyp. Nectria haematococca 1 A-GND1 2 28–117 109 42 BAJ78673 RNA polymerase II largest subunit Bemisia tabaci 2 A-GND1 3 85–197 113 41 BAE91117 Macaca fascicularis 2 A-GND2 1 1–347 347 57 AEL64185 ND2 Homo sapiens 17 A-GND2 2 187–286 138 34 CAF93389 Tetraodon nigroviridis 3 A-GND2 3 45–208 182 37 ADZ45521 GREBP cGMP-response element-binding Homo sapiens 1 A-GND2 6 119–333 229 40 XP002611624 Hyp. Branchiostoma floridae 13 A-GCO1 1 7–505 499 54 CAC37979 Co1 Macaca sylvanus 23 A-GCO1 3 337–440 104 41 XP002801723 Hyp. Macaca sylvanus 2 A-GCO1 4 83–193 111 47 EAL47953 DENN domain protein Entamoeba histolytica 1 A-G
Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotideexchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013),http://dx.doi.org/10.1016/j.biosystems.2013.01.011
ARTICLE IN PRESSG Model
BIO 3360 1–19
H. Seligmann / BioSystems xxx (2013) xxx– xxx 9
Table 3 (Continued)
Gene Loc N Si Id Origin Ter
CO2 1 6–227 222 55 ABB78341 CO2 Homo sapiens 15 A-GCO2 2 32–143 116 37 ABL11420 Ammonia monooxygenase B uncultured crenarchaeote 3 A-GCO2 3 7–73 72 39 EGI53545 FAD dependent oxidoreductase Sphingomonas 1 A-GAT8 1 1–68 68 60 AEQ36342 AT8 Pan paniscus 3 A-GAT8 2 10–54 51 41 CAA04176 DNA gyrase B subunit Myxococcus xanthus 1 A-GAT8 3 11–57 48 46 ACB74250 N-acetylmuramyl-l-alanine amidase, negative regulator of AmpC, AmpD Opitutus
terrae1 A-G
AT8 3 1–42 42 50 XP003199307 KH domain-containing, RNA-binding, signal transduction-associated Danio rerio 0 A-GAT8 4 2–67 66 42 XP003500291 Interferon-activable protein 203-like Cricetulus griseus 2 A-GAT8 6 12–56 45 56 EEQ30817 Hyp. Arthroderma otae 2 A-GAT6 1 4–225 222 62 ADG46521 AT6 Homo sapiens 5 A-GAT6 2 7–108 102 39 XP535203 Nucleolar GTP-binding protein 1 Canis lupus familiaris 5 A-GAT6 3 44–104 62 47 ABZ06095 Hyp. uncultured marine microorganism 0 A-GAT6 3 146–223 83 42 ACX89269 Type VI secretion system Vgr Pectobacterium wasabiae 1 A-GCO3 1 1–256 256 54 ABU64439 CO2 Homo sapiens 17 A-GCO3 3 15–148 134 48 BAB93516 OK/SW-CL.16 Homo sapiens 0 A-GCO3 5 11–86 77 47 AEQ53502 Potassium efflux system KefA protein/small-conductance mechanosensitive
channel Pelagibacterium halotolerans1 A-G
ND3 1 1–102 102 69 AEQ35815 ND3 Pan troglodytes 5 A-GND3 2 2–31 30 57 AAA33306 Regulatory protein Emericella nidulans 0 A-GND3 2 23–74 52 44 EGT79756 4-Alpha-glucanotransferase Haemophilus haemolyticus 0 A-GND3 3 67–103 37 51 GAA30488 Zinc finger protein Clonorchis sinensis 0 A-GND3 5 1–101 104 43 EAW14848 Histone acetylase complex Aspergillus clavatus 5 A-GND3 6 52–81 30 70 ACZ10040 Hyp. Sebaldella termitidis 0 A-GND4l 1 1–97 97 65 ADT82590 ND4l Nomascus siki 2 A-GND4l 2 30–67 39 59 EGC44959 Peptidyl-prolyl cis-trans isomerase Ajellomyces capsulatus 1 A-GND4l 2 2–98 100 39 EEE79199 Hyp. Populus trichocarpa 3 A-GND4l 3 1–25 58 52 EGZ30743 Hyp. Phytophthora sojae 0 A-GND4l 3 24–89 66 45 EET89248 Phosphoketolase Clostridium carboxidivorans 1 A-GND4l 4 12–97 87 43 EHA98470 Sodium/potassium/calcium exchanger 1 Heterocephalus glaber 2 A-GND4 1 1–459 459 58 ABU47843 ND4 Pan troglodytes 22 A-GND4 3 100–216 117 43 BAC85228 Homo sapiens 1 A-GND4 3 135–214 80 46 AAG44628 DC24 Homo sapiens 1 A-GND4 3 61–108 48 52 XP003388243 Mitochondrial import inner membrane translocase Tim22-like Amphimedon
queenslandica0 A-G
ND5 1 16–602 587 55 CAR95863 ND5 Homo sapiens 21 A-GND6 1 5–173 169 54 CAA77005 ND6 Papio hamadryas 15 A-GND6 6 82–129 51 61 EDP00960 Flagella associated membrane Chlamydomonas reinhardtii 0 A-GCytB 1 1–372 372 57 ABU67123 CytB Homo sapiens 14 A-GCytB 3 239–371 134 40 BAB12147 Hyp. Macaca fascicularis 1 A-GND1 2 21–319 320 40 EDL19272 Zonadhesin Mus musculus 7 A-TCO1 1 23–228 210 44 AAY22220 CO1 Macaca nemestrina 5 A-TCO2 1 68–129 62 48 ACK71749 GCN5-related N-acetyltransferase Cyanothece 2 A-TCO2 2 67–131 65 55 EEE25729 Hyp. Toxoplasma gondii 0 A-TCO2 3 3–76 75 51 EDQ71631 Hyp. Physcomitrella patens 0 A-TAT8 1 14–69 59 51 EAY11801 Hyp. Trichomonas vaginalis 1 A-TCO3 2 77–139 83 46 ADE84631 Arginine exporter Rhodobacter capsulatus 0 A-TND3 2 20–85 66 47 NP001186517 Adenosine monophosphate deaminase Gallus gallus 0 A-TND3 2 4–73 72 46 EFQ33388 Hyp. Glomerella graminicola 0 A-TND3 5 42–78 40 60 EDS34822 Hyp. Culex quinquefasciatus 0 A-TND4l 5 22–44 24 54 EFM59912 Efflux transporter Brucella 1 A-TND4 3 53–201 150 42 XP001503289 Synaptopodin-2 Equus caballus 5 A-TCytB 1 79–151 91 49 EAL66426 Alpha adducin Dictyostelium discoideum 3 A-TCytB 1 8–141 136 38 XP002733786 Hyp. Saccoglossus kowalevskii 2 A-TND1 3 191–266 76 55 ACU14018 Hyp. Glycine max 1 C-GND1 5 203–266 64 58 EFH42847 Photosystem I reaction center subunit psi-N Arabidopsis lyrata 1 C-GND1 5 49–126 78 42 CAN60489 Hyp. Vitis vinifera 1 C-GCO1 1 237–395 159 47 AAX37529 CO1 Dolichopoda euxina 12 C-GAT8 5 18–67 50 50 CCD46646 Hyp. Botryotinia fuckeliana 1 C-GAT6 5 120–171 52 54 ZP00052904 Glutamate decarboxylase Magnetospirillum magnetotacticum 0 C-GCO3 5 136–211 76 41 ZP04750338 Short chain dehydrogenase Mycobacterium kansasii 1 C-GCO3 5 146–228 84 40 ABL93860 Short-chain dehydrogenase/reductase SDR Mycobacterium 1 C-GND3 6 13–71 62 55 EGC31827 Hyp. Dictyostelium purpureum 0 C-GND3 6 2–46 48 58 EHB94143 NLP/P60 Pseudoxanthomonas spadix 0 C-GND4l 1 3–97 95 58 ADZ37133 ND4l Rhinopithecus avunculus 6 C-GND4l 4 39–74 37 59 EED89742 Hyp. Thalassiosira pseudonana 1 C-GND4 3 103–205 103 56 BAC85228 Homo sapiens 0 C-GND4 6 81–159 79 43 BAD92431 UDP-Gal:betaGlcNAc beta 1,4-galactosyltransferase 6 variant Homo sapiens 0 C-GND5 1 101–526 426 48 ABB97838 ND5 Homo sapiens 34 C-GND6 1 1–174 174 55 ADO19968 ND6 Homo sapiens 2 C-GND6 3 9–105 101 46 EFX75312 Hyp. Daphnia pulex 3 C-GCO1 1 15–477 463 45 ACY39526 CO1 Tetropium fuscum 0 A-CCO2 1 100–227 128 39 ACM71926 CO2 Homo sapiens 0 A-CCO2 1 2–58 57 51 CCA59686 Hyp. Streptomyces venezuelae 0 A-CCO2 1 143–175 35 66 ZP10139372 Polyketide synthase Fluoribacter dumoffii 0 A-CCO2 3 28–64 37 65 EGT86488 MmpL4 7 Mycobacterium colombiense 1 A-C
Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotideexchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013),http://dx.doi.org/10.1016/j.biosystems.2013.01.011
ARTICLE IN PRESSG Model
BIO 3360 1–19
10 H. Seligmann / BioSystems xxx (2013) xxx– xxx
Table 3 (Continued)
Gene Loc N Si Id Origin Ter
AT8 1 29–58 32 63 EAU75829 AGAP012039-PA Anopheles gambiae 0 A-CAT8 3 8–58 51 61 EAA32833 Nitrate reductase Neurospora crassa 1 A-CCO3 1 10–260 253 43 ABU64439 CO3 Homo sapiens 9 A-CND3 3 66–101 37 59 CCB89733 Hyp. Simkania negevensis 0 A-CND4l 3 33–95 63 56 CAG82479 YALI0C22946p Yarrowia lipolytica 1 A-CND5 1 19–468 455 39 ADH8404 ND5 Sinogastromyzon sichangensis 16 A-CND6 1 1–170 170 61 AFF86056 ND6 Homo sapiens 1 A-CND6 2 91–172 83 46 EDX74154 Hyp. Microcoleus chthonoplastes 1 A-CND6 2 41–128 88 45 CAG99131 KLLA0E02135p Kluyveromyces lactis 0 A-CCytB 1 30–351 324 43 CAP58774 CytB Cobitis sinensis 9 A-CND1 4 165–256 110 44 XP002807608 Adenylate cyclase type 3 Callithrix jacchus 3 ACGTND1 5 54–99 49 61 CBJ30960 Ectocarpus siliculosus 0CO2 6 90–139 51 59 ZP08847098 Pectate lyase Anaerophaga thermohalophila 0AT8 2 18–69 54 56 BAD93095 TEA domain family member 1Homo sapiens 1
25–69 51 51 XP003582797 Leucine-rich repeat transmembrane Bos taurus 1AT8 3 20–56 37 57 AAC49419 Repellent protein Ustilago maydis 2AT8 4 19–49 31 55 AAO43936 Ca2+ homeostasis Arabidopsis thaliana 1AT8 6 15–54 40 58 ACB76837 Multi-sensor hybrid histidine kinase Opitutus terrae 0ND3 4 23–70 48 56 ZP06460666 Major facilitator transporter Pseudomonas syringae 0ND4l 5 5–46 42 64 CBK22878 ABC transporter type 1 Blastocystis hominis 1ND4 5 104–186 85 52 EAQ38580 Hyp. Dokdonia donghaensis 1ND6 1 88–120 33 48 AES66793 Cysteine-rich receptor-like kinase Medicago truncatula 0CytB 5 139–199 61 61 CAI76264 Proton translocating inorganic pyrophosphatase Theileria annulata 1CytB 6 260–319 60 47 EGI65165 Purity of essence Acromyrmex echinatior 0ND1 3 86–149 74 41 ACR37944 Zea mays 1 AGCTND2 2 264–344 88 51 EAZ06402 Hyp. Oryza sativa 0CO1 1 173–218 46 57 GAB46836 Sodium/sulphate symporter Gordonia terrae 0CO1 6 275–356 82 45 CBY23304 Oikopleura dioica 1CO2 2 41–99 61 54 ADE11632 Diguanylate cyclase/phosphodiesterase with PAS/PAC sensor(s) Sideroxydans
lithotrophicus0
CO2 3 14–65 58 43 EHL87055 Aspartate-ammonia ligase Tannerella 0AT8 1 4–48 46 57 EHI12376 Regulator Mycobacterium thermoresistibile 0AT8 4 1–47 50 42 EAQ88006 Hyp. Chaetomium globosum 1AT8 5 9–42 34 56 EFK66562 ABC transporter ATP-binding Streptomyces 1AT8 6 39–68 32 66 EEN61262 Hyp. Branchiostoma floridae 0AT6 2 117–151 35 60 AAH73232 MGC80562 Xenopus laevis 0AT6 3 95–178 103 38 EGO64651 DEAD 2 domain protein Acetonema longum 1
17–71 56 48 CCC14912 Sordaria macrospora 0AT6 6 60–119 115 45 EEW37665 HMP/thiamine-binding Granulicatella adiacens 3CO3 6 111–161 56 45 XP002826409 Cardiotrophin-2-like Pongo abelii 1ND3 6 4–42 39 51 ACAZ21358 Oxidoreductase Sanguibacter keddieii 0
36–63 28 64 CCG92029 Phosphoenolpyruvate carboxylase Methylacidiphilum fumariolicum 0ND4l 3 18–47 32 72 EGC17145 Transcriptional regulator Thiocapsa marina 0
47–79 33 55 EFV05470 Beta-glucosidase Prevotella salivae 1ND4l 5 15–93 83 49 EEC06142 Dihydrodipicolinate synthase Ixodes scapularis 3ND6 5 41–151 110 42 ZP09248514 Ammonium transporter Acaryochloris 0ND6 6 29–120 97 48 ACQ69531 7TM receptor with intracellular metal dependent phosphohydrolase
Exiguobacterium1
81–131 51 51 AEA47822 Phosphoesterase RecJ domain Archaeoglobus veneficus 053–133 81 48 AFI46966 Drug resistance transporter Pasteurella multocida 0
coding, a redoxin for C↔U exchange coding, a FAD dependent660
oxidoreductase for A↔G exchange coding, and a short chain dehy-661
drogenase for C↔G exchange coding. A detailed discussion of each662
case would not be constructive at this preliminary stage of explo-663
ration of nucleotide exchange coding. However, the distribution of664
functions does not seem random, especially in relation to functions665
typically associated with mitochondrial metabolism, including for666
nucleotide exchanges for which no or few transcripts were found667
in Table 2. Hence protein alignment data suggest that one cannot668
exclude the occurrence of any type of nucleotide exchange, though669
some seem more frequent than others.670
All regular mitochondrial main frame-encoded proteins are671
membrane proteins, and these are also frequent among the672
alignment data in Table 3 (25 cases). These include numerous trans-673
porters and symporters, and for example the mitochondrial import674
inner membrane translocase Tim22-like for the A↔G nucleotide675
exchange. Here again, the data at hand suggest protein func-676
tions that seem non-random in relation to known mitochondrial677
functions in the cell’s metabolism. Note that for the A↔C+G↔U678
exchange, a type of nucleotide exchange for which no tran-679
script was detected, alignments with membrane proteins were680
most numerous (7), while no alignments with proteins inter- 681
acting with DNA or RNA were found, and only few (2) with 682
physiological functions associated with mitochondrial metabolism. 683
This would suggest that this type of RNA recoding by nucleotide 684
exchange would specifically produce membrane bound proteins 685
in apparently particularly rare conditions inducing that type of 686
transcriptional nucleotide exchange. This confirms that at this pre- 687
liminary stage, no type of nucleotide exchange should be excluded, 688
even if RNA alignment data in Table 2 are non-existent for that 689
type of nucleotide exchange. It is indeed plausible that each type of 690
nucleotide exchange is induced by specific, perhaps stress- and/or 691
ontogeny-associated conditions, and some might be rarer than 692
others. Further bioinformatics analyses of the putative nucleotide 693
exchange overlap coding sequences yield clues in this respect. 694
3.1.7. Origins of proteins in Table 3 695
The distribution of the proteins in Table 3 along broad system- 696
atic groups is informative to some extent. Table 3 includes only 697
one alignment with proteins from viral, and one from archean 698
origins (0.6% each). Most common were bacterial origins (41.5%), 699
followed by metazoan (25.7%), fungal (17.5%) and ‘vegetal’ (from 700
Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotideexchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013),http://dx.doi.org/10.1016/j.biosystems.2013.01.011
ARTICLE IN PRESSG Model
BIO 3360 1–19
H. Seligmann / BioSystems xxx (2013) xxx– xxx 11
Fig. 3. Relative proportion of proteins from broad systematic evolutionary classesof organisms and viruses in Table 3 as a function of their proportion in GenBank’sprotein database. Proteins from eukaryotic origins (especially metazoan) are over-represented in Table 3. The line indicates x = y.
Viridiplantae) origins (11.11%). Fig. 3 compares these with the701
relative representation of proteins from these respective origins702
in GenBank’s database. This clearly shows that eukaryotic ori-703
gins (organisms that contain mitochondria) are overrepresented704
in Table 3, especially metazoan origins. This general ranking705
and pattern varied little between different types of nucleotide706
exchange codings, though for A↔G+A↔U, metazoan origins were707
more common than bacterial origins, and for G↔U, proteins from708
Viridiplantae were more frequent than from Metazoa. However,709
these variations might be stochastic and are small in relation to710
the overall pattern found when pooling all data from Table 3 and711
presented in Fig. 3.712
Considering that the sequences analyzed are of metazoan ori-713
gins (Homo sapiens), this suggests that overlap coding through714
nucleotide exchange transcription is usually not due to horizon-715
tal transfers (including viruses), but probably evolves gradually716
within phylogenetic groups, and occasionally, the genome will717
include a gene coding for that protein in the regular (non-exchange)718
form. This might result from occasional reverse transcription719
of nucleotide exchange transcripts and their integration in the720
nuclear genome of organisms that possess mitochondria. This721
phenomenon would be compatible with the positive correlation722
between detected protein alignment lengths and exchange RNA723
transcripts in Fig. 2. If this is the mechanism subjacent to the inte-724
gration of a gene coding without nucleotide exchange for a protein725
usually coded by mitochondrial nucleotide exchange transcription,726
the fact that viruses are underrepresented in Table 3 suggests that727
proteins coded by nucleotide exchange are probably proteins with728
some adaptive physiological function. This confirms the fact that729
Table 3 includes numerous proteins that seem adequate for mito-730
chondrial metabolism.731
Along that rationale, the organism indicated in Table 2 in732
which the protein is coded directly by DNA, without nucleotide733
exchange, and whose protein aligns with the human mitochon-734
drial sequence translated after nucleotide exchange, could be an735
organism where the physiological function of the protein coded by736
nucleotide exchange became more frequently required, justifying737
to include a gene that explicitly (=without nucleotide exchange)738
codes for that protein. The methods used here would not detect739
any nucleotide exchange-encoded protein coding gene if this did740
Fig. 4. Mean number of stops per putative protein coding region as detected byBlastp for nucleotide exchange recoded RNAs of human mitochondrial protein cod-ing genes (from Table 3) as a function of the number of exchanging transcriptsaccording to that exchange rule (from Table 2). Relatively common nucleotideexchange transcripts tend to include more stops, indicating that protein expres-sion is limited by the fact that transcription frequency is counterbalanced by thepresence of stop codons necessitating translational activity by suppressor tRNAs.
not occur occasionally. It is probable that not all actual nucleotide 741
exchange encoded genes have been detected by this method, 742
because direct integration of nucleotide exchange coding contents 743
into the genome might not have yet occurred for all nucleotide 744
exchange-encrypted genes, and because GenBank may not include 745
sequences of organisms where this has occurred. Hence it is very 746
likely that Table 3 underestimates numbers of nucleotide exchange 747
encoded genes. 748
The difference between pro- and eukaryotic origins could have 749
an alternative explanation, that exchange transcription and coding 750
is rarer in prokaryotes. Though this possibility exists, it is not very 751
likely, especially that the genome analyzed here is mitochondrial, 752
which probably reflects its prokaryotic ancestor. The evolution- 753
ary scenario for overlapping genes is a more probable explanation 754
for the overrepresentation of proteins from eukaryotic origins in 755
Table 3. 756
3.1.8. Stop codons in putative overlap coding genes 757
Table 3 indicates stop codon numbers within putative overlap 758
coding gene sequences. Considering the density of stops within 759
these sequences across genes, but for each type of nucleotide 760
exchange, stops are much less frequent within putative overlap 761
coding regions than within the rest of the genes after nucleotide 762
exchange. This ranges from 2.26 times less frequent for puta- 763
tive proteins coded by C↔U exchange transcription, to 8.71 times 764
less frequent for those coded by A↔C exchange transcription. 765
The former is the nucleotide exchange type represented by the 766
least, the latter by the most frequent RNA transcript data in 767
Table 2. Hence it seems that stops within putative overlap cod- 768
ing sequences according to exchange transcription modulate the 769
expression of overlapping genes associated with each exchange 770
transcription type, by constraining this expression to conditions 771
where suppressor tRNA activity occurs. Indeed, Fig. 4 shows 772
that stop codon numbers per overlapping gene (from Table 3) 773
increases with numbers of exchange transcripts observed for 774
that type of exchange transcription (from Table 2): r = 0.747, 775
P = 0.0104; rs = 0.854, P = 0.0078, one tailed tests. This suggests that 776
Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotideexchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013),http://dx.doi.org/10.1016/j.biosystems.2013.01.011
ARTICLE IN PRESSG Model
BIO 3360 1–19
12 H. Seligmann / BioSystems xxx (2013) xxx– xxx
the expression of proteins coded by nucleotide exchanges is finely777
tuned by two regulatory mechanisms, one positive, the frequency of778
specific nucleotide exchange transcription and the associated fre-779
quency of transcripts it produces (from Fig. 2), and one negative, the780
frequency of stops making the expression of the proteins depend-781
ent on the joint activities of nucleotide exchange transcription and782
suppressor tRNAs, both probably rare events.783
The close match between transcript frequencies and mean num-784
bers of stops per putative protein coding gene (Fig. 4) suggests that785
the system is finely tuned so that the two regulatory forces, one786
positive, one negative, balance each other: translation of trans-787
cripts produced by rare types of nucleotide exchange transcription788
is relatively unhindered by stop signals, while relatively frequent789
exchange transcription is constrained by numerous stops. This sug-790
gests that expression levels of proteins associated with the various791
types of nucleotide exchanges is regulated so as to be relatively792
equal, despite differences in frequencies at which the different793
types of nucleotide exchange transcriptions apparently occur. This794
highly structured pattern between a positive and a negative reg-795
ulatory mechanism is a strong indication that analyses reveal real796
biological coding phenomena with important, even if probably rare,797
physiological functions. Hence it seems that the expression of genes798
associated with frequent types of exchange transcription is condi-799
tioned by a further condition, that of suppressor tRNA activity. The800
pattern in Fig. 4 is so strong that it suggests an adaptive component801
to this, where suppressor tRNA activity downregulates these types802
of nucleotide exchanges, especially if transcripts are frequent. This803
result is a further indication that such transcription and expression804
is a physiological reality in very specific and unknown conditions.805
3.1.9. Deamination along replicational gradients in genomic806
single strandedness and nucleotide exchange overlapping genes in807
human mitochondrial protein coding genes808
Single stranded DNA is very mutable, as compared to duplex809
DNA. This situation occurs when DNA is replicated, and when RNA810
is transcribed. Mitochondrial DNA replication is unidirectional,811
involving a heavy strand and a light strand replication origin (OH812
and OL). Distances of sites in relation to each OH and OL deter-813
mine the duration sites remain single stranded during replication814
(see for example Krishnan et al., 2004a,b; Seligmann et al., 2006;815
Seligmann, 2008, 2011b).816
In the single stranded state, hydrolytic deaminations A→G and817
C→T are most frequent (note that in this context, A→G and C→T818
are spontaneous mutations occurring at the DNA level, these are819
not systematic nucleotide exchanges during RNA transcription).820
Replicational single strandedness creates gradients in mitochon-821
drial genome nucleotide contents that reflect these spontaneous822
mutations (partial review in Seligmann, 2012a). Their effect on823
nucleotide contents is counterbalanced by functional constraints824
when the nucleotide has crucial coding functions at the protein825
level, and hence nucleotide contents at second codon positions826
barely reflect the mutational single stranded gradients. However,827
these are detectable at third codon positions, the situation is inter-828
mediate for first codon positions (Seligmann et al., 2006).829
Analysing nucleotide contents at third codon positions of the830
regular human main frame protein coding genes in relation to over-831
lapping genes predicted by Blastp analyses assuming suppressor832
tRNA activity confirmed that these regions are involved in over-833
lap coding, as they fit less well deamination gradients than the834
adjacent regions that are not involved in overlap protein cod-835
ing (Seligmann, 2012a). This method also confirmed overlapping836
genes coded by codons of four nucleotides (tetragenes coded by837
tetracodons, Seligmann, 2001) and protein coding genes codedQ5838
in the 3′-to-5′ direction of mitochondrial sequences (Seligmann,839
2012). This method is used here to confirm the existence of the840
putative overlapping genes associated with exchange transcription841
Fig. 5. A/(A+G) nucleotide ratios at 3d codon positions (light strand DNA) in humanmitochondrial protein coding genes as a function of the time spent single strandedduring replication by that gene. The base ratios reflect the C→T deamination thatoccurs on heavy strand DNA during replication, until the complementary lagging(light) strand is polymerized. Filled datapoints are for predicted overlap codingregions after A↔C nucleotide exchange transcription, as presented in Table 2. Hol-low datapoints are for the rest of the gene, not predicted involved in overlap codingafter A↔C exchange transcription. Both datasets fit well the same predicted deami-nation gradient, which suggests the putative overlap coding genes are not functional.Functionality would imply that overlap coding regions cannot mutate according tothe replicational gradients, in order to preserve coding properties, and should hencenot fit the gradient observed for other genome regions. This lack of difference is com-patible with the fact that no RNA transcripts fitting human mitochondrial genes havebeen found in GenBank for nucleotide exchange rule A↔C (Table 2), and that A↔Cexchange transcription is predicted to include few overlap coding genes accordingto Table 3.
and predicted by Blastp (Table 3), for each of the nine symmet- 842
ric nucleotide exchange types. Only analyses for two nucleotide 843
exchange types are presented here, though such detailed analyses 844
were done for each of the nine types of nucleotide exchange. 845
The test of the replicational deamination gradient expects that 846
if a region functions as an overlapping gene, it does not fit well 847
the deamination gradient. However, if candidate overlapping genes 848
are not expected to be frequently expressed and hence are not 849
expected to be functional, these sequences should fit well within 850
the replicational deamination gradient observed for other regions, 851
not expected to function as overlapping genes, and involved only in 852
regular main frame coding. The overlapping genes coded by A↔C 853
nucleotide exchange transcription are expected to be the least func- 854
tional ones according to both criteria available at this point: no 855
RNA transcripts were detected fitting this type of exchange tran- 856
scription, and Blastp analyses of hypothetical protein sequences 857
translated from these exchange transcripts yield the lowest number 858
of alignments with proteins existing in GenBank. No RNA trans- 859
cripts fitting predictions from nucleotide exchange transcription 860
types along the rules A↔C+G↔U, and A↔G+C↔U were detected, 861
but RNAs transcribed along these exchange transcription rules 862
apparently code for numerous proteins, according to Table 3. Hence 863
their situation is much less clear, they might be more functional 864
then indicated by transcript numbers in Table 2. 865
Fig. 5 plots the nucleotide contents A/(A+G) ratio at the third 866
codon position (according to regular main frame codons of the reg- 867
ular protein coding gene) for human mitochondrial protein coding 868
Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotideexchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013),http://dx.doi.org/10.1016/j.biosystems.2013.01.011
ARTICLE IN PRESSG Model
BIO 3360 1–19
H. Seligmann / BioSystems xxx (2013) xxx– xxx 13
Fig. 6. A/(A+G) nucleotide ratios at 3d codon positions (light strand DNA) in humanmitochondrial protein coding genes as a function of the time spent single strandedduring replication by that gene for overlapping genes coded by G↔U exchangetranscription. Filled datapoints are for predicted overlap coding regions after G↔Unucleotide exchange transcription, as presented in Table 2. Hollow datapoints arefor the rest of the gene, not predicted involved in overlap coding after G↔U exchangetranscription. The latter fit well the predicted deamination gradient, but the formermuch less, as expected if overlap coding genes were functional. This fits with the factthat Tables 1 and 2 include numerous G↔U exchange transcripts and predicted over-lapping genes, respectively. This pattern contrasts with the one observed in Fig. 5,which jointly confirms the test’s result reflect overlapping gene functionalities.
genes as a function of the duration spent single stranded during869
replication by that gene, separately for putative overlapping genes870
(filled symbols) and for the rest of the genes (open symbols) for871
A↔C nucleotide exchange transcripts. Note that the nucleotide872
ratios are for the regular human mitochondrial DNA contents of the873
light strand, that encodes most of the regular human mitochondrial874
protein coding genes, and not after A↔C exchange.875
The replicational gradient in light strand A/(A+G) reflects the876
increase in T after C→T (deamination) mutations in single stranded877
heavy strand DNA (Krishnan et al., 2004a,b). The correlation for878
putative overlapping genes is r = 0.50. This Pearson correlation coef-879
ficient is stronger than that for the rest of the genome (r = 0.43). This880
is the opposite of what would be expected for overlap coding genes,881
these data should not fit a gradient. In addition, if one excludes from882
analyses the outlier datapoint for the non-overlap coding region of883
the gene AT8, which is based on very few codons, because most of884
that gene is predicted to be involved in overlap coding, the regres-885
sion lines for both putative overlap and non-overlap coding regions886
become almost identical (dashed line in Fig. 5). This means that the887
deamination gradient test does not confirm the predicted overlap888
coding status for A↔C exchange transcripts. These regions behave889
exactly along the predictions of the deamination gradients, as do890
the other regions of the same genes. In other terms, deamination891
mutations occur according to the same rules in these regions as in892
regions not expected to be involved in overlap coding after A↔C893
exchanges.894
At the other extreme, G↔U exchange transcribed genes are895
predicted to include the largest number of overlapping protein cod-896
ing genes, and several RNA transcripts were detected in GenBank897
matching G↔U exchange transcription of the human mitochon-898
drial genome. Fig. 6 presents the replicational deamination gradient899
analysis of the same human mitochondrial protein coding genes900
as in Fig. 5, but separating putative overlapping genes from the901
rest of the gene for overlap coding predicted for G↔U (and not902
A↔C) transcribed genes. The gradient is clear for regions not pre- 903
dicted involved in overlap coding (r = 0.55, one tailed P = 0.0398), 904
but weaker for predicted overlap coding regions (r = 0.43, one tailed 905
P = 0.0548). This situation fits what is predicted if overlap coding 906
genes are functional. 907
Similar analyses for the other nucleotide exchange types yield 908
qualitatively similar results (deamination gradient weaker for pre- 909
dicted overlap coding regions than for other regions) in all the 910
remaining six types of symmetric nucleotide exchanges. Hence 911
qualitatively, only for A↔C nucleotide exchanges, gradient anal- 912
yses do not fit predictions that overlapping genes are functional 913
(analysis presented in Fig. 5). This functions as a kind of negative 914
positive control (negative because overlapping and other regions 915
behave similarly, positive because the null hypothesis expects the 916
detection of a deamination gradient). According to a one tailed 917
sign test, the probability of getting by chance the qualitative result 918
expected if predicted overlap coding genes are functional (that the 919
gradient should be weaker for predicted overlapping regions than 920
other regions) 8 among 9 times has P = 0.0098 according to a one 921
tailed sign test. Hence the replication gradient analyses apparently 922
confirm that overlap coding according to nucleotide exchange tran- 923
scription predicted by Blastp analyses (Table 3) is generally not an 924
artifact. 925
The only type of nucleotide exchange transcription for which 926
the qualitative result of comparisons between deamination gra- 927
dients observed for predicted overlap coding regions and other 928
regions do not confirm the functionality of the overlapping genes 929
is for the nucleotide exchange type that according to other anal- 930
yses (in Tables 1 and 2) is the least likely to occur. This also 931
strengthens the working hypothesis, as well as the adequacy of 932
deamination gradient analyses as a test for functionality of pre- 933
dicted mitochondrion-encoded overlapping genes. 934
One obtains qualitatively similar results when performing these 935
gradient analyses for C/(C+T) nucleotide contents at third codon 936
positions. In that case, the replicational gradient is stronger in reg- 937
ular regions than in predicted overlap coding regions for seven 938
among nine types of nucleotide exchanges, which is also a statis- 939
tically significant majority of cases according to a one sided sign 940
test (P = 0.0449). While weaker, this result nevertheless confirms 941
overlap coding status for most predicted candidate overlap coding 942
regions and most types of nucleotide exchanges. 943
3.2. Circular code analyses confirm overlap coding 944
The possibility that nucleotide exchange transcription increases 945
the coding potential of genes could be a major discovery, but at this 946
point, evidence on proteins translated from the predicted overlap- 947
ping genes is still totally missing. For that reason, an additional 948
computational test is used to strengthen the status of the predicted 949
overlap coding genes presented in Table 3. This test uses a theoret- 950
ical background that totally differs from the deamination gradient 951
analyses presented in the previous section, is based on different 952
information and sequence properties, and hence is totally inde- 953
pendent of the deamination gradient test presented in the previous 954
section. 955
Empirical observations have shown that some codons are over- 956
represented in overlapping genes, as compared to regular genes, 957
while other codons are underrepresented (Ahmed et al., 2007, 958
2010; Ahmed and Michel, 2011). The overrepresented codons are 959
homopolymer codons, hence AAA, CCC, GGG and UUU. The under- 960
represented ones form a circular code (Arqués and Michel, 1996, Q6 961
1997; Michel, 2008; Ahmed and Michel, 2011; Gonzalez et al., 962
2011). The reasons for that remain unclear, but from an empirical 963
point of view, this enables to test whether codon usages in pre- 964
dicted overlap coding genes are indeed optimized along the lines of 965
avoiding circular code codons and preferred usage of homogenous 966
Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotideexchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013),http://dx.doi.org/10.1016/j.biosystems.2013.01.011
ARTICLE IN PRESSG Model
BIO 3360 1–19
14 H. Seligmann / BioSystems xxx (2013) xxx– xxx
codons. They might have to do with ribosomal frame-maintenance,967
as parts of ribosomal RNA involved in interacting with the mRNA968
also form circular codes (Michel, 2012).969
Independently of the interesting theoretical underpinnings to970
the links between circular codes and overlap coding, circular codes971
can be used to test whether codon usages in predicted overlap cod-972
ing regions fit the circular code. In that context, I analyzed each of973
the three frames of each human protein coding gene, in relation974
to each set of 20 ‘circular code codons’, each set associated with975
one frame, separating predicted overlap coding regions (accord-976
ing to Table 3) from other regions, for each of the nine nucleotide977
exchange types. Homogenous codons were scored −1, and codons978
belonging to the circular code for that frame were scored 1. These979
scores were averaged and compared between putative overlap980
coding regions and other regions, expecting lower scores for over-981
lapping genes (as predicted by analyses presented in Table 2) than982
for other regions if the predicted codons are over- and underrep-983
resented within the predicted overlapping gene as compared to984
regular coding regions.985
Mean scores for the two types of regions could be compared by986
t-tests, but here I restrict analyses to the statistically robust non-987
parametric sign test. Comparing mean scores obtained for each988
putative overlap coding region with the mean score of the rest989
of that gene, I tested whether the number of times that the score990
was lower in the overlap coding region is significantly more fre-991
quent than the 50% expected if no pattern exists in the data. This992
yields three nucleotide exchange types with P < 0.05 according to993
one tailed sign tests: A↔U exchange (14 positive results among994
21, which yields according to a one tailed sign test P = 0.0473);995
A↔G+C↔U (27 among 39 positive comparisons between regu-996
lar and predicted overlap coding regions, one tailed sign test997
P = 0.0059); and A↔U+C↔G (37 positive among 45 comparisons,998
P = 0.0000014). Note that several transcripts for two of these types999
of nucleotide exchanges have been detected (A↔U, in Table 2;1000
and A↔U+C↔G, Seligmann, 2012d), and though Table 2 does not1001
include any transcript fitting A↔G+C↔U exchange transcription,1002
this type of nucleotide exchange has the most numerous overlap1003
coding genes according to analyses in Table 3.1004
One can test the working hypothesis by combining the one tailed1005
P values obtained from sign tests for all nine types of nucleotide1006
exchanges. Fisher’s method for combining P values sums −2 × ln Pi,1007
where Pi is the P value obtained for the ith test, and i ranges from1008
1 to k. This sum is a chi-square statistic with 2 × k degrees of free-1009
doms, in the present case 43.82, which with 18 degrees of freedoms1010
has P = 0.00061. Hence the null hypothesis for the combined data1011
is rejected: predicted overlap coding genes tend to avoid circu-1012
lar code codons and prefer homogenous codons, as compared to1013
regular coding regions, when considering all types of nucleotide1014
exchanges altogether. This confirms their coding status according1015
to the circular code approach.1016
3.3. Convergence between functionality predictions of overlap1017
coding genes by deamination gradient and circular code tests1018
Examination of Figs. 5 and 6 shows that nucleotide ratios at1019
third codon positions for some predicted overlap coding genes1020
fit better replicational deamination gradients than for other pre-1021
dicted overlap coding genes. The extent by which the datapoint1022
digresses from the deamination gradient might be proportional1023
to gene functionality. In this respect, predicted overlap coding1024
regions match approximately as well the gradient as other regions1025
for A↔C exchange transcription, while digressions were much1026
greater for G↔U exchange transcription, which seems to match1027
the greater abundance of G↔U RNA transcripts in Table 2. By1028
extension, this rationale could apply to different overlapping genes1029
from the same type of nucleotide exchange. Possibly, those with1030
Fig. 7. Circular code overlapping gene score versus absolute residual of third codonposition A/(A+G) from deamination gradient. The y axis is the subtraction of the cir-cular code score calculated for gene regions coding only in the regular main frame,from the score obtained for regions predicted involved in overlap coding accordingto G↔U exchange transcription (from Table 3). Presumably, the lower this scoreis, the greater the functionality of the predicted overlap coding gene. The x axis isthe absolute residual of the A/(A+G) base ratio for the same overlap coding regionsfor G↔U exchange transcription from the replicational deamination gradient pre-sented in Fig. 6. Functionality of overlap coding genes is assumed proportional to thisabsolute residual. Hence the negative association in Fig. 7 suggests that functional-ity estimates for the same putative overlapping genes, but from different methods,tend to converge.
greater absolute digression from the gradient are more functional 1031
than those matching more closely the gradient. Hence functionality 1032
might be proportional to the absolute value of the residual of the 1033
A/(A+G) ratio at third codon position for a putative overlap coding 1034
gene from the deamination gradient observed for regular regions. 1035
A similar rationale can be developed for the subtraction of the 1036
mean ‘circular code’ scores for gene regions involved only in regular 1037
coding from the ‘circular code’ score obtained for predicted overlap 1038
coding genes in that gene. One might assume that overlap coding 1039
functionality decreases the more positive the value obtained from 1040
that subtraction. 1041
According to these functionality rationales, absolute residuals 1042
(from deamination gradients) and circular code score subtractions 1043
should be negatively correlated, because according to that interpre- 1044
tation, they would estimate the same phenomenon. It is important 1045
to remind in this context that the two tests are totally independent 1046
from each other in terms of theoretical backgrounds, and ana- 1047
lyze different properties of the sequences. Hence a positive result 1048
(meaning a negative correlation in this context) is not trivial. Fig. 7 1049
plots the circular code score for predicted overlap coding regions 1050
for G↔U exchange transcription according to Table 3, as a func- 1051
tion of the absolute value of residuals for A/(A+G) ratios at third 1052
codon positions for these putative overlap coding regions from the 1053
deamination gradient analysis in Fig. 6. The presumed functionality 1054
estimates from these independent tests are indeed negatively cor- 1055
related (r = −0.6272, one tailed P = 0.0082; but note that rs = −0.36, 1056
one tailed P = 0.095), as one would expect if these estimates reflect 1057
functionality of the different predicted overlap coding genes. Hence 1058
gene-wise results for the two tests of overlapping gene function- 1059
ality might confirm each other. The fact that the more robust but 1060
less sensitive nonparametric Spearman rank correlation analysis, 1061
rs, does not confirm the result of the parametric analysis does not 1062
invalidate the principle, but at this point does not allow high con- 1063
fidence in the result. 1064
Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotideexchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013),http://dx.doi.org/10.1016/j.biosystems.2013.01.011
ARTICLE IN PRESSG Model
BIO 3360 1–19
H. Seligmann / BioSystems xxx (2013) xxx– xxx 15
Analyses similar to those in Fig. 7 were done for each of the eight1065
remaining types of nucleotide exchanges, and the correlation was1066
negative (as expected) in 5 among 9. It was statistically significant1067
according to a one tailed test for Pearson correlation coefficients1068
for putative overlap coding genes due to C↔U exchange transcrip-1069
tion (r = −0.538, P = 0.044) and those due to A↔U+C↔G exchange1070
transcription (r = −0.675, P = 0.0029). Hence convergence between1071
functionality estimates from replicational mutation gradient anal-1072
yses and circular score analyses was statistically significant at1073
P < 0.05 for three among the nine types of potential transcriptional1074
nucleotide exchanges. All three are according to Table 2 for types1075
of nucleotide exchanges that are relatively frequently encountered1076
at the level of RNA transcripts. The fact that two additional analy-1077
ses yield qualitatively the same statistically significant result does1078
increase the confidence level in the result, despite inconclusive con-1079
firmation of the trend in Fig. 7 by the nonparametric rs analysis. This1080
is because the replicability of a result, by independent tests, is the1081
best insurance against false rejection of the null hypothesis.1082
3.4. A meta-analysis of exchange nucleotide transcripts and1083
coding1084
The analysis in Fig. 7 indicates that each deamination gradient1085
and circular code analyses converge for putative overlap coding1086
genes according to G↔U nucleotide exchanges. Similar levels of1087
convergence were found for two other nucleotide exchanges (C↔U,1088
and A↔U+C↔G), all three are among the nucleotide exchanges with1089
the most abundant transcript data in Table 2. It is possible that the1090
level of convergence between deamination gradient and circular1091
code analyses, as estimated by r2 as the one in Fig. 7, is inversely1092
proportional to RNA transcription. Fig. 8 plots this r2 while keep-1093
ing the sign of the correlation coefficient (the more negative, the1094
more convergence, adding the value 1 to avoid negative numbers),1095
as a function of the number of genome regions for which RNA1096
transcripts were found (Table 2). The negative trend expected is1097
detected by Pearson’s parametric correlation coefficient r = −0.611,1098
one tailed P = 0.04, but cannot be statistically confirmed at P < 0.051099
by Spearman nonparametric rank correlation rs = −0.5, one tailed1100
P = 0.078.1101
Nevertheless, the fact that nucleotide exchange types for which1102
a high level of convergence between tests for overlap coding exists1103
are also those for which RNA transcription is relatively frequent is1104
not at all trivial. It is not simply a confirmation of cryptic coding after1105
nucleotide exchange, and of nucleotide exchange transcription. It1106
shows that the independent evidence for each of these phenomena1107
tends to be coherently integrated. Hence Fig. 8 integrates all the evi-1108
dence presented here, and shows consistency between all the types1109
of analyses. Hence despite the speculative impression given by the1110
working hypothesis due to its presumed revolutionary meaning1111
in relation to accepted principles of molecular biology, the data1112
at hand are a strong confirmation that the working hypothesis is a1113
valid approach for understanding coding properties of DNA, and the1114
way these are expressed. This implies that the number of protein1115
coding genes is approximately by one order of magnitude greater1116
than believed until now in the presumably well known vertebrate1117
mitochondrial genome.1118
3.5. Human DNA gamma polymerase misinsertion1119
polymerization rates and systematic symmetric nucleotide1120
exchange polymerization1121
There is a further important piece of evidence confirming the1122
working hypothesis, in relation to the existence of nucleotide1123
exchange polymerization. Unlike the analyses of putative over-1124
lapping genes, this evidence is solely based on direct empirical1125
experimental observation, and is therefore a very strong argument1126
Fig. 8. Convergence between deamination gradient and circular code analyses as afunction of the number of genome regions that are exchange transcribed accordingto Table 2. The y axis is the Pearson correlation coefficient r (+1) between the abso-lute value of residual A/(A+G) at 3d codon positions for regions predicted to functionas overlapping genes according to a given nucleotide exchange rule (according toTable 3) and the circular code score for that putative overlapping gene, for eachof the nine types of nucleotide exchanges. The lower the value according to the yaxis, the greater the convergence between deamination and circular code analysesin confirming overlap coding for that type of nucleotide exchange. Fig. 7 shows thedata used to calculate that Pearson correlation coefficient for G↔U exchange tran-scription, analyses similar to those in Fig. 7 for nucleotide exchange G↔U were donefor each of the nine nucleotide exchange rules and Pearson correlation coefficientsfrom these analyses are used in the y axis of Fig. 8. The x axis is the number of genomeregions for which RNA was detected according to that specific nucleotide exchangerule. The trend in Fig. 8 shows that nucleotide exchange types according to whichRNA has been detected for numerous regions are also those for which analyses asthose in Fig. 7 indicate a high degree of convergence between deamination and cir-cular code analyses. This shows that convergence between two types of independentbioinformatics analyses converges with detected frequencies of RNA transcripts, an‘experimental’ confirmation (x axis) of complex computational results (y axis).
favoring the working hypothesis. It is plausible that systematic 1127
nucleotide exchanges during RNA polymerization follow in prin- 1128
ciple very similar physico-chemical and enzymatic processes as 1129
the occasional nucleotide misinsertions (corresponding to the same 1130
replacing and replaced nucleotides as in exchange transcription), 1131
as these are known for the human mitochondrial DNA gamma 1132
polymerase (Lee and Johnson, 2006) and some other polymerases 1133
(i.e., Bertram et al., 2010; Zamft et al., 2012). Hence this approach 1134
assumes that properties of misinsertions, such as their rate param- 1135
eters, should be proportional to the abundance of RNAs produced 1136
by systematic nucleotide exchanges corresponding to the replaced 1137
and replacing nucleotides by that DNA misinsertion. In short, 1138
systematic nucleotide exchanges should follow kinetic principles 1139
observed for occasional (erroneous) nucleotide exchanges (misin- 1140
sertions). 1141
Transcription is a DNA→RNA directed process, but no data 1142
on the mitochondrial RNA polymerase’s fidelity is available. 1143
Because DNA and RNA are quite similar, misinsertion rates by 1144
the mitochondrial DNA polymerase gamma were used for these 1145
analyses. This is also adequate because one cannot exclude 1146
that this enzyme is responsible for exchanging RNA polymer- 1147
ization, perhaps in combination with specific conditions and/or 1148
other proteins. The modulation of which type of systematic 1149
nucleotide exchanging RNA polymerization could be determined 1150
by such interactions with the polymerase(s) responsible for 1151
nucleotide exchanging RNA polymerization. According to a sim- 1152
plistic Michaelis–Menten approach to enzymatic reaction kinetics, 1153
Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotideexchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013),http://dx.doi.org/10.1016/j.biosystems.2013.01.011
ARTICLE IN PRESSG Model
BIO 3360 1–19
16 H. Seligmann / BioSystems xxx (2013) xxx– xxx
Fig. 9. Mean kd for nucleotide misinsertions by the human mitochondrial DNApolymerase gamma as a function of numbers of RNA transcripts with systematicsymmetric RNA nucleotide exchanges corresponding to nucleotide misinser-tions in the DNA. Transcript abundances are from Table 2, mean kds (affinities)are from Table 2 in Lee and Johnson (2006). This shows that RNAs producedby nucleotide exchange transcription are predicted by misinsertion kinetics ofexchanged nucleotides.
reactions are parametrized according to the affinity and the maxi-1154
mal reaction rate, the first reflecting initial reaction rates when the1155
substrate is rare (the medium has few free nucleotides for insertion,1156
the enzyme is relatively frequent as compared to its substrate), the1157
second when it is saturated (the medium is rich in free nucleotides1158
to be (mis)inserted). These parameters are indicated as kd and kpol,1159
respectively, in Table 2 from Lee and Johnson (2006).1160
For each type of systematic symmetric nucleotide exchange, I1161
averaged the corresponding kds, and separately, kpols from Lee1162
and Johnson (2006). The mean polymerization kd (calculated for1163
each type of nucleotide exchange) is negatively correlated with1164
the mean length of RNA transcripts detected for the correspond-1165
ing nucleotide exchanges (r = −0.689, P = 0.02; rs = −0.6, P = 0.045,1166
one tailed tests, Fig. 9). The association with kpol is statistically1167
weaker, and positive, which is not surprising because kd and kpol1168
are inversely proportional, a well known phenomenon in kinet-1169
ics: a high enzymatic specificity for its substrate (high affinity, kd)1170
comes at the expense of its maximal rate. Hence results suggest that1171
frequent types of nucleotide exchanges correspond to nucleotide1172
misinsertions with high kpol (low kd). The observation that statis-1173
tically, correlations are strongest with kd suggests that symmetric1174
systematic nucleotide exchanges are limited by conditions where1175
nucleotides are relatively rare. Putatively, these systematic sym-1176
metric nucleotide exchanges occur when nucleotides are relatively1177
scarce, hence explaining a stronger effect of kd than kpol on their1178
elongation.1179
This result is remarkable because it means that some physico-1180
chemical and/or enzymatic principles inherent to nucleotide1181
misinsertions coherently explain the data in Table 2. This excludes1182
that artifacts created the RNAs in Table 2. The phenomena described1183
here are shown meaningful on each chemical and biological1184
grounds.1185
The mean kd also predicts levels of expressions of predicted1186
overlapping genes, as these are estimated by the difference1187
between the strength of the replicational deamination gradients1188
observed at (main frame) third codon positions for regions that are1189
predicted involved in overlap coding (after systematic nucleotide1190
exchange) versus third codon positions in other regions of the1191
Fig. 10. Difference between strengths of replicational deamination gradient inregions not involved in overlap coding and in those involved in overlap coding as afunction of mean kd for nucleotide misinsertions by the human mitochondrial DNApolymerase gamma corresponding to the nucleotide exchanges observed in the RNA.Open circles are for A→G, closed circles for C→T deamination gradients (light strandannotation, not to be confused with nucleotide exchanges during RNA transcription,A→G and C→T in this case represent spontaneous mutations by deaminations dur-ing DNA replication). The y axis is calculated, for each type of nucleotide exchange,from an analysis as that presented in Figs. 5 and 6. The x axis is identical to that inFig. 9. The result shows as for Fig. 8 that computational results from bioinformaticsanalyses converge with misinsertion kinetics of exchanged nucleotides.
same genes (see analyses in Section 2.2.4 and corresponding 1192
Figs. 5 and 6). Along that approach, the stronger the gradient for 1193
non-overlap coding regions as compared to predicted overlap cod- 1194
ing regions, the weaker the expression of the predicted overlapping 1195
genes encoded by that type of symmetric systematic nucleotide 1196
exchange. 1197
Fig. 10 plots this difference (after a z transformation of the 1198
Pearson correlation coefficients (Amzallag, 2001) that estimate the 1199
strengths of the replicational deamination gradients, the z trans- 1200
formation accounts for sample size effects (Seligmann et al., 2007)) 1201
as a function of the mean kd. The gradient analyses were done 1202
separately for two types of transitions predicted to follow the repli- 1203
cational gradient, A→G and C→T (hollow circles and filled symbols 1204
in Fig. 10, respectively). Note that in this case A→G and C→T are 1205
mutations due to deaminations that occur during DNA replica- 1206
tion, not nucleotide exchanges occurring during RNA transcription. 1207
For each A→G and C→T gradients, the replicational gradient is 1208
stronger for regions not expected involved in overlap coding than 1209
for those expected involved in overlap coding in a majority of 1210
types of symmetric nucleotide exchanges (values above ‘1’ on the 1211
y axis in Fig. 10), and this difference increases with mean kd (A→G 1212
gradient, r = 0.622, P = 0.037, rs = 0.533, P = 0.0655; C→T gradient, 1213
r = 0.722, P = 0.014, rs = 0.717, P = 0.021, one tailed tests). Hence 1214
types of nucleotide exchange polymerizations that are expected to 1215
have high rates of polymerization at low nucleotide concentrations 1216
seem to be most expressed, and therefore predicted overlapping 1217
protein coding genes are proportionally more conserved as com- 1218
pared to the replicational deamination gradient observed in other 1219
regions of the genes. 1220
The same principle is observed in relation to circular code anal- 1221
yses (Section 2.3 and y axis of Fig. 7) as estimating expression 1222
of predicted overlapping protein coding genes. The relative usage 1223
of homopolymers as opposed to circular code codons is expected 1224
greater in expressed overlap coding regions than in other regions, 1225
Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotideexchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013),http://dx.doi.org/10.1016/j.biosystems.2013.01.011
ARTICLE IN PRESSG Model
BIO 3360 1–19
H. Seligmann / BioSystems xxx (2013) xxx– xxx 17
Fig. 11. Difference between proportions of homopolymers among circular codecodons in predicted overlap coding regions and in other regions for different typesof systematic symmetric nucleotide exchanges as a function of the mean kpol ofcorresponding nucleotide misinsertions (from Table 2 in Lee and Johnson, 2006).Overlapping genes expected more expressed according to the y axis correspondto types of nucleotide misinsertions with high DNA polymerization rates (x axis).The result shows as for Figs. 8 and 9 that computational results from bioinformaticsanalyses converge with the known DNA misinsertion kinetics of the RNA exchangednucleotides.
and this difference is expected to increase with expression levels.1226
Fig. 11 shows that high mean kpols correspond to types of sys-1227
tematic nucleotide exchanges where this difference is large, and1228
vice versa (r = 0.708, P = 0.016; rs = 0.70, P = 0.024, one tailed tests).1229
Hence here, bioinformatics analyses estimating high expression1230
levels correspond to types of nucleotide exchange polymerizations1231
that are expected to have high rates of polymerization at high1232
nucleotide concentrations.1233
These results suggest that deamination gradient analyses esti-1234
mate more expression of predicted overlapping genes encrypted1235
by systematic symmetric nucleotide exchanges at low nucleotide1236
concentrations. Hence these might be associated with stressful con-1237
ditions such as low resource availability and low metabolism, as1238
suggested for other types of alternative mitochondrial gene expres-1239
sions (Seligmann, 2010c, 2011a), which putatively would favor1240
deaminations. Figs. 7 and 8 show that deamination gradient and1241
circular code analyses tend to converge in their overall patterns, yet1242
it seems that each approach fits better specific conditions. Circular1243
code analyses seem to estimate better expression of overlapping1244
genes encrypted by nucleotide exchanges at high concentrations1245
of free nucleotides.1246
4. General discussion1247
The analyses presented above confirm the hypothesis that1248
transcription that exchanges systematically nucleotides (in a sym-1249
metric manner) reveals protein coding genes that were not1250
detected until now in the human mitochondrial genome. A num-1251
ber of lines of evidence suggest this: (1) RNA transcripts fitting1252
polymerization according to several nucleotide exchange rules are1253
detected in GenBank’s EST database (Table 2); (2) Blastp analy-1254
ses of putative polypeptides translated from ‘exchange transcribed’1255
sequences yield numerous alignments with proteins existing in1256
GenBank; (3) identities of proteins aligning seem non-random in 1257
relation to mitochondrial metabolism and include numerous pro- 1258
teins interacting with DNA and RNA (putatively, future studies 1259
will find that some of the proteins responsible for exchange tran- 1260
scription are among these); (4) these putative overlapping protein 1261
coding genes include few stop codons; (5) bias against stop codons 1262
within putative overlapping protein coding genes is inversely pro- 1263
portional to transcript abundances of nucleotide exchange types, 1264
suggesting a balance between positive and negative regulations 1265
of expression of overlapping genes coded by nucleotide exchange 1266
transcription (upregulation) and stop codon presence (downreg- 1267
ulation); (6) replicational deamination gradient analyses tend to 1268
confirm the coding status of putative overlapping protein coding 1269
genes; (7) circular code analyses of codon usages in putative overlap 1270
coding regions also confirm this status; (8) results of 6 and 7 tend 1271
to converge; (9) that level of convergence is consistent with the 1272
number of genome regions that are found ‘exchange transcribed’; 1273
(10) frequencies and lengths of RNA transcripts corresponding to 1274
different types of nucleotide exchanges are explained by kinetic 1275
parameters of occasional nucleotide misinsertions by the human 1276
mitochondrial DNA polymerase gamma that reflect the assumed 1277
transcriptional nucleotide exchanges. It is particularly notable that 1278
results for each of the 10 levels are independent, yet yield a highly 1279
integrated overall picture. 1280
This confirms that the coding system is much more complex 1281
than usually believed (Mercer et al., 2011a,b), and that some types 1282
of coding/recoding events, though apparently rare or very rare, 1283
actually exist. At this point, the next major steps are similar analyses 1284
for nucleotide exchanges that are not symmetric, and to investigate 1285
whether the proteins predicted by the analyses can be found and 1286
extracted from mitochondria. It is important to note that analyses 1287
suggest that some of the overlap coding genes seem more opti- 1288
mized than others. This could have two meanings: their expression 1289
level is greater, and/or their function is more important. It is not cer- 1290
tain that nucleotide exchange types that seem more frequent are 1291
necessarily those that are most important from a functional point 1292
of view. Hence transcript abundance does not need to be perfectly 1293
correlated with optimization. It seems plausible that the impor- 1294
tance of a coding system associated with a given type of nucleotide 1295
exchange is not only reflected by the abundance of transcripts 1296
detected (Table 2). This is also reflected by the number of puta- 1297
tive protein coding genes detected (Table 3), and the extents by 1298
which overlap coding is independent of transcription. 1299
The analyses compare between different types of nucleotide 1300
exchanges. It is possible that these are not all variations of the same 1301
phenomenon. Besides the fact that six nucleotide exchange rules 1302
involve only a pair of nucleotides, and that three involve two pairs, 1303
some of these pairs exchange between nucleotides of the same 1304
type (purine to purine, or pyrimidine to pyrimidine), while oth- 1305
ers do not. This might imply mechanisms of different natures. In 1306
addition, the nucleotide exchange A↔U+C↔G could be compatible 1307
with a different type of polymerization, which does not necessarily 1308
imply nucleotide exchange, but would result in the same transcript 1309
sequence. It might result from regular 5′-to-3′ RNA polymerization 1310
where the progression follows the 3′-to-5′ direction of a sequence, a 1311
phenomenon that has not yet been observed (but note that 3′-to-5′1312
directed RNA polymerization occurs (Jackman et al., 2012), and that 1313
also in mitochondria), but for which evidence exists (Seligmann, 1314
2012d). Such RNA is also compatible with RNA that forms DNA- 1315
RNA triplexes according to antiparallel Hoogsteen base pairings, 1316
which have been observed in vertebrate mitochondria (Annex and 1317
Williams, 1990; Rocher et al., 2002; Takamatsu et al., 2002). 1318
There are other notable observations pertaining to overlap 1319
coding through systematic nucleotide exchanges. Most proteins 1320
aligning with sequences translated from such exchange tran- 1321
scribed human mitochondrial sequences have eukaryotic, mainly 1322
Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotideexchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013),http://dx.doi.org/10.1016/j.biosystems.2013.01.011
ARTICLE IN PRESSG Model
BIO 3360 1–19
18 H. Seligmann / BioSystems xxx (2013) xxx– xxx
metazoan origins. The alignment data suggests that occasionally,1323
the nucleotide exchange-coded genes are recoded and integrated1324
in the genome so that the protein is coded without nucleotide1325
exchange. This seems to occur relatively rarely, as in most cases,1326
only one protein from one organism aligns with the protein trans-1327
lated from nucleotide exchange transcripts. The data indicate1328
some phyletic clustering for mitochondrially nucleotide-exchange1329
encoded overlap coding genes and those that are encoded with-1330
out nucleotide exchange: organisms possessing mitochondria are1331
overrepresented among the latter.1332
The analyses clearly exclude the possibility that transcripts1333
detected as exchanging nucleotides are due to some kind of annota-1334
tion error or statistical artifacts (i.e., Fig. 1), especially that exchange1335
transcription rates are predicted by rate parameters of misinser-1336
tion kinetics for corresponding nucleotides: systematic nucleotide1337
exchange rates during RNA transcription are proportional to occa-1338
sional replicational mutation rates due to DNA misinsertion of the1339
same nucleotide types (Figs. 9–11). Hence nucleotide exchanging1340
transcription fits basic biochemical nucleotide properties that also1341
affect their DNA misinsertion rate kinetics. However, the possibil-1342
ity that these ESTs are the product of dysfunctional polymerases1343
during the process creating the cDNA libraries is also a possibility.1344
In that case, the data in Table 2 would not directly reflect frequen-1345
cies of naturally occurring nucleotide exchanging transcription in1346
the mitochondrion. These would only be indirectly estimated, from1347
the production of cDNAs by RNA→DNA reverse transcription. Both1348
possibilities are plausible, and are not mutually exclusive. However,1349
even if the ‘unnatural’ scenario for exchange transcript production1350
was correct, the transcript abundances produced by that ‘unnatural’1351
mechanism are proportional to computational predictions of over-1352
lap protein coding genes embedded in nucleotide exchange RNA1353
transcripts (Figs. 2, 4, 8, 10 and 11). This coherence between gene1354
contents and transcript abundance indicates that abundances from1355
Table 2 reflect a natural reality of mitochondria (and cells), even1356
if RNA→DNA reverse transcription, and not DNA→RNA transcrip-1357
tion, produced the suspected transcripts. In that case, occasional1358
RNA→DNA reverse transcriptase dysfunctions would have given1359
insights to the existence of a previously unknown family of related1360
types of polymerization.1361
Nucleotide exchange coding, as a way to encode for more genes1362
without increasing genome length, seems particularly adequate for1363
the dense vertebrate mitochondrial genome, however, there is no1364
ground a priori to assume that such coding is limited to the mito-1365
chondrial genome. It is very probable that at various levels, this1366
type of coding occurs also in the nucleus, and in prokaryotes. Hence1367
protein coding genes encoded by genomes might be much more1368
numerous than believed.1369
References1370
Ahmed, A., Frey, G., Michel, C.J., 2007. Frameshift signals in genes associated with1371
the circular code. In Silico Biol. 7, 155–168.1372
Ahmed, A., Frey, G., Michel, C.J., 2010. Essential molecular functions associated with1373
circular code evolution. J. Theor. Biol. 264, 613–622.1374
Ahmed, A., Michel, C.J., 2011. Circular code signal in frameshift genes. J. Comp. Sci.1375
Syst. Biol. 4, 7–15.1376
Akashi, H., Gojobori, T., 2002. Metabolic efficiency and amino acid composition in1377
the proteomes of Escherichia coli and Bacillus subtilis. Proc. Natl. Acad. Sci. U.S.A.1378
99, 3695–3700.1379
Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman,1380
D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database1381
search programs. Nucl. Acids Res. 25, 3389–3402.1382
Altschul, S.F., Wootton, J.C., Gertz, E.M., Agarwala, R., Morgulis, A., Schäffer, A.A., Yu,1383
Y.K., 2005. Protein database searches using compositionally adjusted substitu-1384
tion matrices. FEBS J. 272, 5101–5109.1385
Alves, R., Savageau, M.A., 2005. Evidence of selection for low cognate amino acid1386
bias in amino acid biosynthetic enzymes. Mol. Microbiol. 56, 1017–1034.1387
Amzallag, G.N., 2001. Data analysis in plant physiology: are we missing the reality?1388
Plant Cell Environ. 24, 881–890.1389
Annex, B.H., Williams, R.S., 1990. Mitochondrial DNA structure and expression 1390
in specialized subtypes of mammalian striated muscle. Mol. Cell. Biol. 10, 1391
56171–65678. 1392
Arqués, D.G., Michel, C.J., 1996. A complementary circular code in the protein coding 1393
genes. J. Theor. Biol. 182, 45–58. 1394
Arqués, D.G., Michel, C.J., 1997. A circular code in the protein coding genes of 1395
mitochondria. J. Theor. Biol. 189, 273–290. 1396
Barton, M.D., Delneri, D., Oliver, S.G., Rattray, M., Bergman, M.C., 2010. Evolu- 1397
tionary systems biology of amino acid biosynthetic cost in yeast. PLoS One 5, 1398
e11935. 1399
Bertram, J.G., Oertell, K., Petruska, J., Goodman, M.F., 2010. DNA polymerase fidelity: 1400
comparing direct competition of right and wrong dNTP substrates with steady 1401
state and pre-steady state kinetics. Biochemistry 49, 20–28. 1402
Brocchieri, L., Karlin, S., 2005. Protein length in eukaryotic and prokaryotic pro- 1403
teomes. Nucl. Acids Res. 33, 3390–3400. 1404
Chipman, A.D., Khaner, O., Haas, A., Tchernov, E., 2001. The evolution of genome 1405
size: what can be learned from anuran development? J. Exp. Zool. A 291, 1406
364–374. 1407
Daniel, C., Wahlstedt, H., Ohlson, J., Bjork, P., Ohman, M., 2011. Adenosine-to-inosine 1408
RNA editing affects trafficking of the �-aminobutyric acid type A (GABAA) recep- 1409
tor. J. Biol. Chem. 286, 2031–2040. 1410
Dias Neto, E., Garcia Correa, R., Verjovski-Almeida, S., Briones, M.R., Nagai, M.A., 1411
da Silva Jr., W., Zago, M.A., Bordin, S., Costa, F.F., Goldman, G.H., Carvalho, A.F., 1412
Matsukuma, A., Baia, G.S., Simpson, D.H., Brunstein, A., deOliveira, P.S., Bucher, 1413
P., Jongeneel, C.V., O’Hare, M.J., Soares, F., Brentani, R.R., Reis, L.F., de Souza, S.J., 1414
Simpson, A.J., 2000. Shotgun sequencing of the human transcriptome with ORF 1415
expressed sequence tags. Proc. Natl. Acad. Sci. U.S.A. 97, 3491–3496. 1416
Faure, E., Delaye, L., Tribolo, S., Levasseur, A., Seligmann, H., Barthélémy, R.-M., 2011. 1417
Probable presence of an ubiquitous cryptic mitochondrial gene on the antisense 1418
strand of the cytochrome oxidase I gene. Biol. Direct 6, 56. 1419
Fredrico, A., Kunkel, T.A., Shaw, B.R., 1990. A sensitive genetic assay for the detection 1420
of cytosine deamination: determination of rate constants and the activation 1421
energy. Biochemistry 29, 2532–2537. 1422
Gonzalez, D.L., Giannerini, S., Rosa, R., 2011. Circular codes revisited: a statistical 1423
approach. J. Theor. Biol. 275, 21–28. 1424
Gregory, T.R., Hebert, P.D.N., 1999. The modulation of DNA content: proximate 1425
causes and ultimate consequences. Genome Res. 9, 317–324. 1426
Huang, G.M., Ng, W., Farkas, l., He, J., Liang, L., Gordon, H.A., Yu, D., Hood, J.L., 1999. 1427
Prostate cancer expression profiling by cDNA sequencing analysis. Genomics 59, 1428
178–186. 1429
Itzkovitz, S., Alon, Y., 2007. The genetic code is nearly optimal for allowing additional 1430
information within protein-coding sequences. Genome Res. 17, 405–412. 1431
Jackman, J.E., Gott, J.M., Gray, M.W., 2012. Doing it in the reverse: 3′-to-5′ polymer- 1432
ization by the Thg1 superfamily. RNA 18, 886–899. 1433
Jin, Y., Tian, N., Cao, J., Liang, J., Yang, Z., Mv, J., 2007. RNA editing and alternative splic- 1434
ing of the insect nAChR subunit alpha6 transcript: evolutionary conservation, 1435
divergence and regulation. BMC Evol. Biol. 7, 98. 1436
Kasiviswanathan, R., Copeland, W.C., 2011. Ribonucleotid discrimination and 1437
reverse transcription by the human mitochondrial DNA polymerase. J. Biol. 1438
Chem. 286, 31490–31500. 1439
Krishnan, N.M., Seligmann, H., Raina, S.Z., Pollock, D.D., 2004a. Detecting gradients 1440
of asymmetry in site-specific substitutions in mitochondrial genomes. DNA Cell 1441
Biol. 23, 707–714. 1442
Krishnan, N.M., Seligmann, H., Raina, S.Z., Pollock, D.D., 2004b. Phylogenetic anal- 1443
yses detect site-specific perturbations in asymmetric mutation gradients. Curr. 1444
Comput. Mol. Biol. 2004, 266–267. 1445
Krizek, M., Krizek, P., 2012. Why has nature invented three stop codons of DNA and 1446
only one start codon? J. Theor. Biol. 304, 183–187. 1447
Lee, H.R., Johnson, K.A., 2006. Fidelity of the human mitochondrial DNA polymerase. 1448
J. Biol. Chem. 281, 36236–36240. 1449
Lev-Maor, G., Sorek, R., Levanon, E.Y., Paz, N., Eisenberg, E., Ast, G., 2007. RNA-editing- 1450
mediated exon evolution. Genome Biol. 8, R29. 1451
Liew, C.C., Hwang, D.M., Fung, Y.W., Laurenssen, C., Cukerman, E., Tsui, S., Lee, 1452
C.Y., 1994. A catalogue of genes in the cardiovascular system as identified by 1453
expressed sequence tags. Proc. Natl. Acad. Sci. U.S.A. 91, 10645–10649. 1454
Lui, V.W.Y., Luk, S.C.W., Tsui, S.K.W., Tung, C.K.C., Yam, N.Y.H., Liew, C.C., Lee, C.Y., 1455
1995. Gene expression of adult human heart as revealed by random sequencing 1456
of cDNA library. In: Miami Winter BioTechnol. Symp. Proc., vol. 6, p. 90. 1457
Mercer, T.R., Dinger, M.E., Crawford, J., Smith, M.A., Shearwood, A.M., Haugen, E., 1458
Bracken, C.P., Rackham, O., Stamatoyannopoulos, J.A., Filipovska, A., Mattick, J.S., 1459
2011a. The human mitochondrial transcriptome. Cell 146, 645–658. 1460
Mercer, T.R., Gerhardt, D.J., Dinger, M.E., Crawford, J., Trapnell, C., Jeddeloh, J.A., 1461
Mattick, J.S., Rinn, J.L., 2011b. Targeted RNA sequencing reveals the deep com- 1462
plexity of the human transcriptome. Nat. Biotechnol. 30, 99–104. 1463
Michel, C.J., 2012. Circular code motifs in transfer RNA and 16S ribosomal RNAs: a 1464
possible translation code in genes. Comput. Biol. Chem. 34, 24–37. 1465
Namy, O., Lecointe, F., Grosjean, H., Rousset, J.-H., 2005. Translational recoding and 1466
RNA modifications. Fine-tuning of NRA functions by modification and editing. 1467
Top. Curr. Genet. 12, 2005–2340. 1468
Paz, N., Levanon, E.Y., Amariglio, N., Heimberger, A.B., Ram, Z., Constantini, S., 1469
Barbashi, Z.S., Adamsky, K., Safran, M., Hirschberg, A., Krupsky, M., Ben- 1470
Dov, I., Cazacu, S., Mikkelsen, T., Brodie, C., Eisenberg, E., Rechavi, G., 2007. 1471
Altered adenosine-to-inosine RNA editing in human cancer. Genome Res. 17, 1472
1586–1595. 1473
Perlstein, E.O., de Bivort, B.L., Schreiber, S.L., 2007. Evolutionary conserved optimiza- 1474
tion of amino acid biosynthesis. J. Mol. Evol. 98, 186–196. 1475
Please cite this article in press as: Seligmann, H., Polymerization of non-complementary RNA: Systematic symmetric nucleotideexchanges mainly involving uracil produce mitochondrial RNA transcripts coding for cryptic overlapping genes. BioSystems (2013),http://dx.doi.org/10.1016/j.biosystems.2013.01.011
ARTICLE IN PRESSG Model
BIO 3360 1–19
H. Seligmann / BioSystems xxx (2013) xxx– xxx 19
Raina, S.Z., Faith, J.J., Disotell, T.R., Seligmann, H., Stewart, C.B., Pollock, D.D., 2005.1476
Evolution of base-substitution gradients in primate mitochondrial genomes.1477
Genome Res. 15, 665–673.1478
Reenan, R.A., 2005. Molecular determinants and guided evolution of species-specific1479
RNA editing. Nature 434, 409–413.1480
Rocher, C., Letellier, T., Copeland, W.C., Lestienne, P., 2002. Base composi-1481
tion at mtDNA boundaries suggests a DNA triple helix model for human1482
mitochondrial DNA large-scale rearrangements. Mol. Genet. Metab. 76,1483
123–132.1484
Seligmann, H., 2007. Cost minimization of ribosomal frameshifts. J. Theor. Biol. 249,1485
162–167.1486
Seligmann, H., 2008. Hybridization between mitochondrial heavy strand tDNA and1487
expressed light strand tRNA modulates the function of heavy strand tDNA as1488
light strand replication origin. J. Mol. Biol. 379, 188–199.1489
Seligmann, H., 2010a. The ambush hypothesis at the whole-organism level: off1490
frame, ‘hidden’ stops in vertebrate mitochondrial genes increase developmental1491
stability. Comp. Biol. Chem. 34, 80–85.1492
Seligmann, H., 2010b. Avoidance of antisense antiterminator tRNA anticodons in1493
vertebrate mitochondria. Biosystems 101, 42–50.1494
Seligmann, H., 2010c. Undetected antisense tRNAs in mitochondrial genomes? Biol.1495
Direct 5, 39.1496
Seligmann, H., 2011a. Two genetic codes, one genome: frameshifted primate1497
mitochondrial genes code for additional proteins in presence of antisense1498
antitermination tRNAs. Biosystems 106, 271–286.1499
Seligmann, H., 2011b. Mutation patterns due to converging mitochondrial repli-1500
cation and transcription increase lifespan, and cause growth rate-longevity1501
tradeoffs. In: Seligmann, H. (Ed.), DNA Replication—Current Advances. InTech,1502
pp. 151–180 (Chapter 6).1503
Seligmann, H., 2012a. Coding constraints modulate chemically spontaneous muta-1504
tional replication gradients in mitochondrial genomes. Curr. Genomics 13,1505
37–54.1506
Seligmann, H., 2012b. Positive and negative cognate amino acid bias affects com-1507
positions of aminoacyl-tRNA synthetases and reflects functional constraints on1508
protein structure. BIO 2, 11–26.1509
Seligmann, H., 2012c. An overlapping genetic code for frameshifted overlapping1510
genes in Drosophila mitochondria: antisense antitermination tRNAs UAR insert1511
serine. J. Theor. Biol. 296, 61–76.1512
Seligmann, H. Overlapping genetic codes for overlapping frameshifted genes inQ71513
testudines, and Lepidochelys olivacea as a special case. Comp. Biol. Chem., in1514
press.1515
Seligmann, H., 2012d. Overlapping genes coded in the 3′-to-5′ direction in mito-1516
chondrial genes and 3′-to-5′ polymerization of non-complementary RNA by an1517
‘invertase’. J. Theor. Biol. 315, 38–52.
Seligmann, H., 2012e. Putative mitochondrial polypeptides coded by expanded 1518
quadruplet codons, decoded by antisense tRNAs with unusual anticodons. 1519
Biosystems 110, 84–106. 1520
Seligmann, H. Putative protein-encoding genes within mitochondrial rDNA and the 1521
D-loop region. In: Lin, Z., Liu, W. (Eds.), Ribosomes: Molecular Structure, Role in 1522
Biological Functions and Implications for Genetic Diseases, Nova Publishers, in 1523
press. 1524
Seligmann, H., Anderson, S.C., Autumn, K., Bouskila, A., Saf, R., Tuniyev, B.S., Werner, 1525
Y.L., 2007. Analysis of the locomotor activity of a nocturnal desert lizard (Rep- 1526
tilia: Gekkonidae: Teratoscincus scincus) under varying moonlight. Zoology 110, 1527
104–117. 1528
Seligmann, H., Krishnan, N.M., Rao, B.J., 2006. Possible multiple origins of replication 1529
in primate mitochondria: alternative role of tRNA sequences. J. Theor. Biol. 241, 1530
321–332. 1531
Seligmann, H., Pollock, D.D., 2004a. The ambush hypothesis: hidden stop codons 1532
prevent off-frame gene reading. In: Midsouth Computational Biology and Bioin- 1533
formatics Society, vol. 36, Abstract. 1534
Seligmann, H., Pollock, D.D., 2004b. The ambush hypothesis: hidden stop codons 1535
prevent off-frame gene reading. DNA Cell Biol. 23, 701–705. 1536
Sessions, S.K., Larson, A., 1987. Developmental correlates of genome size in plethod- 1537
ontid salamanders and their implications for genome evolution. Evolution 41, 1538
1239–1251. 1539
Singh, T.R., Pardasani, K.R., 2009. Ambush hypothesis revisited: evidences for phy- 1540
logenetic trends. Comput. Biol. Chem. 33, 239–244. 1541
Takamatsu, C., Umeda, S., Ohsato, T., Ohno, T., Abe, Y., Fukuoh, A., Shinagawa, H., 1542
Hamasaki, N., Kang, D., 2002. Regulation of mitochondrial D-loops by transcrip- 1543
tion factor A and single-stranded DNA-binding protein. EMBO Rep. 3, 451–456. 1544
Tanaka, M., Ozawa, T., 1994. Strand asymmetry in human mitochondrial mutations. 1545
Genomics 22, 327–335. 1546
Tse, H., Cai, J.J., Tsoi, H.-W., Lam, E.P.T., Yuen, K.-Y., 2010. Natural selection 1547
retains overrepresented out-of-frame stop codons against frameshift peptides 1548
in prokaryotes. BMC Genomics 11, 491. 1549
Warnecke, T., Hurst, L.D., 2011. Error prevention and mitigation as forces in the 1550
evolution of genes and genomes. Nat. Rev. Genet. 12, 875–881. 1551
Warringer, J., Blomberg, A., 2006. Evolutionary constraints on yeast protein size. 1552
BMC Evol. Biol. 6, 61. 1553
Zamft, B.M., Marblestone, A.H., Kording, K., Schmidt, D., Martin-Alarcon, D., Tyo, K., 1554
Boyden, E.S., Church, G., 2012. Measuring cation dependent DNA polymerase 1555
fidelity landscapes by deep sequencing. PLoS One 7, e43876. 1556
Zhang, Z., Schwartz, S., Wagner, L., Miller, W., 2000. A greedy algorithm for aligning 1557
DNA sequences. J. Comp. Biol. 7, 203–214. 1558
Zuker, M., 2003. Mfold web server for nucleic acid folding and hybridization predic- 1559
tion. Nucl. Acids Res. 31, 3406–3415. 1560