+ All Categories
Home > Documents > Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA...

Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA...

Date post: 24-Apr-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
30
A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary Text Supplementary Material for Unusual DNA structures associated with germline genetic activity in Caenorhabditis elegans Genetics (published online April 2006). Andrew Fire, Rosa Alcazar, and Frederick Tan, Stanford University Supplementary Methods Predictions of DNA bending PATC algorithm Construction of a minimal redundancy annotated gene prediction set for C. elegans. Extraction and Aggregation of SAGE Data Evaluation of Associations between SAGE data and PATC Score: Supplementary References Supplementary Figures S1. Periodicities of individual tetranucleotides in the C. elegans genome. Legend S1 Panel S1A Panel S1B Panel S1C Panel S1D S2. Separation profiles between arbitrary tetranucleotides and the periodically-enriched tetranucleotide AAAA/TTTT. Legend S2 Panel S2A Panel S2B S3. Asymmetry in periodic distribution of AAAA/TTTT tetranucleotides in the C. elegans genome. Legend S3 Panel S3A Panel S3B S4. Long range periodicity measures of the C. elegans genome. Legend S4 Panel S4A Panel S4B Panel S4C Panel S4D Panel S4E S5. List of investigator-named C. elegans genes sorted by periodic character. Legend S5 Figure S5 is included in supplementary material as a separate excel document S6. List of all C. elegans gene models with calculated periodicity characteristics. Legend S6 Figure S6 is included in supplementary material as a separate excel document S7. Annotation-based assessment of association between strong periodicity and germline function for C. elegans genes. Legend S7 Panel S7 S8. Periodicity profiles for 12 germline-expressed genes that have been investigated for possible adaptation to produce extrachromosomal expression vectors. Legend S8 Panel S8 S9. Prediction of unusual "bent" character in the C. elegans genome with a wide variety of computational parameters. Legend S9 Panel S9 Two additional archive files are provided separately from this .pdf document: PATC_Workbench.pas is Pascal source code implementing PATC Algorithm, and is provided as a text file which compiles in Metrowerks Pascal (Motorola). Files contains internal notes of console and compiler based switches. AllMotifPeriod.pas is Pascal source code implementing separation plots for each tetranucleotide.
Transcript
Page 1: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary Text

Supplementary Material for Unusual DNA structures associated with germline genetic activity in Caenorhabditis elegans Genetics (published online April 2006). Andrew Fire, Rosa Alcazar, and Frederick Tan, Stanford University Supplementary Methods Predictions of DNA bending PATC algorithm Construction of a minimal redundancy annotated gene prediction set for C. elegans. Extraction and Aggregation of SAGE Data Evaluation of Associations between SAGE data and PATC Score: Supplementary References Supplementary Figures S1. Periodicities of individual tetranucleotides in the C. elegans genome. Legend S1 Panel S1A Panel S1B Panel S1C Panel S1D S2. Separation profiles between arbitrary tetranucleotides and the periodically-enriched tetranucleotide AAAA/TTTT. Legend S2 Panel S2A Panel S2B S3. Asymmetry in periodic distribution of AAAA/TTTT tetranucleotides in the C. elegans genome. Legend S3 Panel S3A Panel S3B S4. Long range periodicity measures of the C. elegans genome. Legend S4 Panel S4A Panel S4B Panel S4C Panel S4D Panel S4E S5. List of investigator-named C. elegans genes sorted by periodic character. Legend S5 Figure S5 is included in supplementary material as a separate excel document S6. List of all C. elegans gene models with calculated periodicity characteristics. Legend S6 Figure S6 is included in supplementary material as a separate excel document S7. Annotation-based assessment of association between strong periodicity and germline function for C. elegans genes. Legend S7 Panel S7 S8. Periodicity profiles for 12 germline-expressed genes that have been investigated for possible adaptation to produce

extrachromosomal expression vectors. Legend S8 Panel S8 S9. Prediction of unusual "bent" character in the C. elegans genome with a wide variety of computational parameters. Legend S9 Panel S9 Two additional archive files are provided separately from this .pdf document: PATC_Workbench.pas is Pascal source code implementing PATC Algorithm, and is provided as a text file which compiles in Metrowerks Pascal (Motorola). Files contains internal notes of console and compiler based switches. AllMotifPeriod.pas is Pascal source code implementing separation plots for each tetranucleotide.

Page 2: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary Text

Supplementary Methods Predictions of DNA bending Algorithms of Ulanovsky and Trifonov (1987) ['UT87'], Koo et al. (1986) ['KWC86'], and Bolshoy et al. (1991) ['BMHT91'] were adapted to analysis of genome-size DNA sequences by reducing each algorithm to a series of rapidly calculated sequence-specific adjustments to the path of the predicted helix in 3-dimensional space. For the 'KWC86' algorithm, we have assumed that a four base segment (AAAA or TTTT) is required to produce bends at each end; under these conditions, this algorithm is much more stringent than the two dinucleotide-based algorithms ('UT87' and 'BMHT91'). Relaxing the stringency of the 'KWC86' algorithm to require only three consecutive A's or T's gives a much closer apposition of the results from this algorithm to those from the UT87 algorithm. Results from the UT87 and BMHT algorithms have similarly been relatively insensitive to the stringency of the assay. Figure S9 shows a series of calculations in which different stringencies and segment lengths show a consistent difference between natural C. elegans sequence and a randomized control. PATC algorithm From analysis of exemplary cases and statistical considerations, we developed a scoring system where the number of AA or TT dinucleotides in a 5-base segment is assigned a score, as are the sequence displacements (9-12bp) between AA/TT clusters. The algorithm then consists of following a window along the DNA sequence while determining the number and quality of An/Tn clusters that occur on the single face of the predicted duplex. The "wobble" in the AA/TT dinucleotide positioning relative to the helix as well as in the precise periodicity of the helix needs to be accommodated as the window is "slid" along the DNA by examining a window of 4 consecutive dinucleotides and testing different helical periods (9,10,11,12) for each turn. Defining such a window optimally at every point in the genome has the potential to be a computationally demanding task: we have addressed this using a localized optimization strategy similar to that used by programs designed to play chess: at each point an optimal decision is made based on optimizing consequences at a distance of several moves. With the parameters described above, a "look ahead" window of four "moves" appeared sufficient to detect the majority of regions with high An/Tn periodicity. In detail, the algorithm proceeds as follows: A window of 1280bp slides along the DNA sequence. Starting from the end of this window, the algorithm begins to define a series of "steps" of 9-12 base pairs backward along the DNA sequence. Each step identifies the center of a 5-base segment of the DNA that is scored for clustering of AA/TT junctions follows: segments with four AA/TT junctions (the maximum for a 5-base segment) are scored 30, segments with three, two, one, and zero AA/TT junctions are scored 20, 10, 0, and -5 respectively. A cost is then assessed based on the length of each step: 10 base steps (canonical helical pitch for DNA) scores are assessed at the lowest rate (8 points), while non-canonical steps (9,11,12) are assessed at higher rates (16,16, and 32 points respectively). The algorithm continues to work backward along the DNA sampling all possible combinations of the next three steps, until a stopping point is reached for which no combination of the next three backward-reaching steps would increase the score. In practice the

Page 3: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary Text

computational load of the algorithm is limited in that the stopping point is generally reached within a few helical turns for DNA sequence which is not strongly structured. The PATC score assigned to any given base is the maximum value for all combinations of steps that extend through the base in question. Parameters for this algorithm were set essentially by trial-and-error, starting with statistical costs based on occurrence of features in randomized DNA sequence (each difference of 20-points was designed to reflect a roughly 10-fold difference in background probability). Slight modifications were then made to minimize overweighting of certain simple features in DNA sequence (e.g., A or T homopolymer tracks). Reassuringly, rather significant deviations in parameters for the program produced only marginal differences in distribution of apparent periodicity for the C. elegans genome (data not shown). Note, however, that long near-homopolymeric A or T tracts (>40bp) will generate a high score with this algorithm. The rarity of such tracts in the C. elegans genome ensures that the algorithm truly fulfils the goal of identifying periodic segments. To ensure that this is the case for other scanned genomes, the program maintains a list of the 10 highest scoring genome segments that have been identified in a DNA sample. Manual inspection of this list after running a novel genome is useful to assess the suitability of the algorithm in the corresponding analysis. Construction of a minimal redundancy annotated gene prediction set for C. elegans. The analysis in figures 5 and 6 required us to have a set of individual gene sequences for C. elegans that were annotated in terms of coding regions, introns, exons, 5' and 3' regions. Despite the remarkable genome resources available for C. elegans, assembly of defined "unigene" list required a certain degree of filtering and oversight. Following a download of the entire set of C. elegans gene models from wormbase, we first needed to remove a small number of clearly incorrect models with incorrect endpoints (these appeared upon comparison with the individual gene annotations to be simple typographical errors; although few in number they represent large regions of DNA and were thus potentially confounding for further analysis). Subsequently we removed all but one model for each individual gene (taking the first alphabetical example in each case). Each gene model was downloaded with 1000bp of upstream sequence and 1000bp of downstream sequence as well as the putative transcribed region. Importantly, most genes are only annotated in terms of protein coding sequences and corresponding introns. Thus the database of 5' and 3' transcribed non-translated sequence is quite incomplete while the database of immediate 5' and 3' flanking non-transcribed sequences will contain substantial components of noncoding intron, exon, and outtron [Blumenthal and Gleason, 2003] sequence. In addition, the choice of a single gene model for each putative coding region (required for the database to be non-redundant) means that some 5' and 3' flanking sequences will actually be coding and intron sequences for differentially spliced isoforms (or correct models) for individual genes. Extraction and Aggregation of SAGE Data Sage data from the UBC genome center was extracted May 2005, requesting only unambiguous tags assigned to mRNA transcripts. Data were pooled so that the total number of tags from a given gene were summed (file: longsummed.txt; only data from long-sage experiments was utilized). A gene

Page 4: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary Text

list was then derived representing unique gene names for only those genes that were represented at least once (for any tissue) in the long SAGE data set (file: longsummedlist.txt). Duplicate gene models were eliminated from these set by keeping only a single model (alphabetically the first) for any group with the numerical character *###*#.#.# (e.g. Y134G5.3.1, Y134G5.3.2, etc) or *###*#.#* (e.g. Y134G5.1a, Y134G5.1b, etc) [where *'s are letters and #'s are numbers]. These data were combined with gene-by-gene measurements of PATC score within intron sequences, using the May 2005 version of the C. elegans gene list from which we filtered out a number of artefactual gene models that were misannotated as covering most of a chromosome. Evaluation of Associations between SAGE data and PATC Score: The non-linear (bell-shaped) curve of periodicity versus expression rules out using simple correlation measures to represent the association, rather we have used a more general measure as follows: Each tissue-specific set of SAGE data can be used to generate a model for predicting PATC scores for C. elegans genes. These models "predict" that PATC value will be the average score of all other genes with the corresponding tissue-specific SAGE value. If SAGE values for a given tissue are unrelated to PATC score, then the resulting tissue-specific model will be useless: no better than simply guessing the all-gene average in each case. By contrast, higher association between expression and PATC score will produce a more informative prediction based on the SAGE data. The information value of a given model [Shannon, 1948] can be assessed using tools from communications signal processing. Formulaically, the log likelihood ratio based on a given set of SAGE data is calculated as a weighted sum as follows log10(LR)=∑genes Ig/Ia * ( Fg*log10(Fs/Fa) + (1-Fg)*log10((1-Fs)/(1-Fa)) )

== ∑genes Ig/Ia * Xg

Where Ig= Number of intronic bases in gene "g"

Ia= Average number of intronic bases in all tested genes

Fg= Fraction of intronic sequences in gene g with above-threshold PATC score (≥95)

Fs= [Number of highly periodic (PATC score ≥95) intronic bases in all genes except g with an

equivalent [*] SAGE signal to that seen with gene g] / [total number of intronic bases in these

genes]. Fa= Fraction of intronic sequences (in all genes) with above-threshold PATC score (≥95).

This calculation can also be described in terms of a Bayesian value expressing the relative likelihood of two different datasets (e.g., Sage Data) providing optimal prediction of a non-independent value (in this case the PATC≥95 fraction) or as a value related to a size-weighted Cross-Entropy (Kullback and Liebler, 1951).

Page 5: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary Text

*These models would show some degree of unintended noise with Sage data due to the volatility of PATC≥95 fractions for bins with large SAGE values and thus few items in each bin. To get around this, we stabilized the models for large values of the SAGE hit number (S) by averaging the PATC≥95 fraction for genes with adjacent values of S. Unless there are at least 10 genes with a similar S score, the algorithm does a running average by collecting genes with lower and higher S values until each category has at least 10 genes. Comparisons of datasets using this algorithm make some assumption about equivalent numbers of observation in each dataset. Relationships between database size and observed likelihood ratio are complex, and to rule out substantial effects of this, we carried out a number of simulations in which each SAGE data set was artificially segmented into a number of random data subsubsets (we used 18000 SAGE tags as the size of the daset segments). Each data subset was then used to calculate a separate likelihood ratio, yielding several independent assessments of association for each tissue. Data from distinct random subsets was then used to produce a sample-size independent mean and standard error for each of the likelihood measurements for different tissue sets. Although not using all of the data from the SAGE samples (and thus producing somewhat noisier signals), this "partition averaged" approach has advantages of producing a sample-size indpendent measure of association and of yielding a rough value for potential variation due to random sampling in the SAGE datasets. As expected for the roughly equivalent sizes of the SAGE datsets used, the partition-averaged analysis yielded similar results to those shown in Figure 6B (values are expressed in average likelihood ratio relative to null for datasets of 18000 SAGE tags partitionaed at random from the full set; ± value represents the standard error of this mean among all partitions tested [two independent partitionings of each dataset].) Autosome Arms+Left tip of X: Oocytes 2.39±0.38, Embryos -0.63±0.16, Gut 0.56±0.52, Muscle -0.04±0.12, Hypodermis -0.52±0.09, Neurons(All) 1.16±0.30, Neurons(Ciliated) 0.36±0.20, Pharynx(All) 0.11±0.14, Neurons(AFD) 2.54±0.57, Pharynx(MarginalCells) 2.28±0.34. Autosome Centers+Remainder of X: Oocytes 0.80±0.10, Embryos 0.17±0.06, Gut 0.07±0.05, Muscle -0.18±0.04, Hypodermis 0.10±0.08, Neurons(All) 0.58±0.11, Neurons(Ciliated) -0.19±0.05, Pharynx(All) -0.02±0.08, Neurons(AFD) -0.02±0.05, Pharynx(MarginalCells) 0.18±0.05. Supplemental References: Blumenthal, T., and K. Gleason, 2003 Caenorhabditis elegans operons: form and function. Nat.

Rev. Genet. 4:112-120. Kullback, S. and R. Leibler, 1951 On information and sufficiency. Annals of Mathematical

Statistics 22:79-86. Shannon, C., 1948 A mathematical theory of communication. The Bell System Technical Journal.

27:379–423; 623–656. Stinchcomb, D., J. Shaw, S. Carr, and D. Hirsh, 1985 Extrachromosomal DNA transformation of

Caenorhabditis elegans. Mol. Cell. Biol. 5:3484-3496.

Page 6: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary Text

Figure S1. Periodicities of individual tetranucleotides in the C. elegans genome. Data were obtained using identical methods and datasets to that of figure 3. Panel A shows plots of coincidence frequency versus separation (obtained as in Figure 3A) for all 256 tetranucleotides in the complete C. elegans genome. Panel B shows a similar analysis for a repeat-subtracted and coding region-depleted version of the C. elegans genome (all repeated sequences of ≥25nt and all protein coding sequences are ignored as in Figure 3). Complementary tetranucleotides (e.g., AAGG/CCTT) are shown in vertical mirror symmetry. Distinct coloring of vertical lines every 10.2 base pairs has been added to comparison to helical repeat of DNA. The first vertical dark line in each graph corresponds to a separation of 10bp. Three different types of normalization can be carried out for the graphs. First (and in all cases), the number of coincidences is divided by the total number of cases in which there is an opportunity for two tetranucleotides to be separated by n bases. For unmasked sequences, this is close to the length of the sequence. For masked sequences, this number varies as a function of the density of masked regions. Second (and for all curves shown), each graph has been scaled so that the maximum value for each histogram corresponds to the upper bound of the micropanel (this allows display of graphs with widely different maximal values on the same figure). To accentuate certain features, Panel C (complete genome) and Panel D (repeat and coding region-depleted genome) have been derived from panels A and B respectively by a third normalization (termed "scaling"). For each graph on panels C and D, the lower bound (smallest number of coincidences for a given tetranucleotide) has been arbitrarily set to the zero-point on the Y axis. The latter normalization aids in identifying periodicities in the sequence but confounds somewhat the comparison of periodicities for different datasets or from different tetranucleotides. Thus the pre-scaled graphs (Panels A and B) provide a somewhat better means to compare datasets and tetranucleotides for periodicity, while scaled graphs (Panels C and D) provide better means to detect subtle periodicities.

Page 7: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary Text

Figure S2. Separation profiles between arbitrary tetranucleotides and the periodically-enriched tetranucleotide AAAA/TTTT. These profiles were prepared as for figure S1 except that the profiles shown denote the profile of separations between distinct tetranucleotides. Panel A shows separation profile between TTTT and a subsequent unique tetranucleotide . Panel B shows separation profiles between AAAA and a subsequent unique tetranucleotide. Complementary unique tetranucleotides are shown in upward-downward pairs to allow comparison. A strong lack of symmetry can be noted in many cases for the two paired graphs, due to the non-equivalence of motif pairs such as TTTT(X)nGGAA and TTTT(X)nTTCC. These profiles were prepared from unfiltered genome files obtained from wormbase [Spieth et al., 2005] as of May 2005.

Page 8: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary Text

Figure S3. Asymmetry in periodic distribution of AAAA/TTTT tetranucleotides in the C. elegans genome. Panel A expands two distinct separation profiles prepared from C. elegans genomic DNA. The blue graph shows distributions of separation distance between a TTTT word and a subsequent (downstream) AAAA word in the genome. The superimposed red graph shows distribution of separations between an AAAA word and subsequent TTTT word. Particularly notable are differences in peak positions between the red and blue curves (e.g., peaks at 10 [blue] and 11 bp [red]) and in the overall periodicity of the curves (e.g., 88-100bp and 140-250 bp). Panel B, as a reference to indicate magnitudes of differences due to stochastic fluctuations in genome orientation, shows histograms for AAAA(Xn-4)AAAA to TTTT(Xn-4)TTTT separations in the genome. Unlike the two curves in Panel A, those in Panel B represent complementary sequence arrangements; the curves in Panel B would thus be expected to correspond closely (as they do) with any fluctuations of a limited quantitative nature. These profiles were prepared from unfiltered genome files obtained from wormbase as of May 2005 and are normalized and scaled as described in Figure 2.

Page 9: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary Text

Figure S4. Long range periodicity measures of the C. elegans genome. Panel A shows the number of bases in the genome present in PATC islands of a specified length, with islands determined as in Figure 3A using a cutoff value of 95. Panels B,C,D, and E use a measurement of long range correlation described as follows: For each five base word we define a periodicity contribution "Pc" as follows : Pc=-2 if there are no AA/TT dinucleotides, Pc=-1 if there is one AA/TT dinucleotide, Pc=1 if there are two AA/TT dinucleotides, and Pc=2 if there are three or four AA/TT dinucleotides. For each possible separation (n bases, X axis), a base-by-base sum is prepared of ∑all points in genome(X)Pc(x)*Pc(x-n). For each n, the resulting sum is normalized by dividing by the number of individual cases in which two words of any type are separated by n base pairs. Panel B shows this sum prepared for all intron sequences which fall within the autosomal arms (i.e., excluding all but the tip of the X chromosome and the arms of each autosome). Panel C shows a corresponding figure in which only sequences within a single intron are allowed to contribute to the histogram. Panel D shows a comparable plot in which only sequences in different introns within the same gene were allowed to contribute (thus requiring at least one exon separate the two words to be assayed). Panel E shows a Fourier analysis of data in Panel B. Values in Panel E were calculated according to the formula V(B)=[∑all separations

nsin(2πn/B)Ω(n)]2+[∑all separations ncos(2πn/B)Ω(n)]2. Where B is the separation being tested as a potential resonance [in this each multiple of 0.01 between 8 and 12 base pairs], n are different separations (5-1280) for which the long-range correspondence was calculated, and Ω(n) is the correspondence value for n reported in Panel B. Although this analysis yields a "best fit" period of 10.06 bp (Figure S4E), it is important to note that the precise value of the best-fit period parameter is not an algorithm-independent feature of the underlying sequence. In particular, we note modest variations in best fit phase for several different correlation plots derived from C. elegans DNA using the different algorithms in this paper (e.g., similar Fourier analysis of Figure 2C generates a best fit phase of 10.20). The variability here presumably reflects (at least in part) a pseudo-periodic character to each of the plots, in particular a slight (but consistent) variability in individual peak-to-peak distances along the X axis for each of the plots. The "best fit" periodicity will thus represent an amalgam of these slightly different periodicities whose precise final value will depend on the weighting parameters used to generate the coincidence plot as well as on the exact sequence dataset used. Pseudo-periodic character is certainly not unexpected in such analysis: DNA wrapped around large objects with precise and reproducible local structure (such as the nucleosome) would be expected to experience slight but consistent local variations in exact helical density and indeed this has been observed experimentally and inferred bioinformatically [e.g., Luger et al., 1997; Ioshekhes et al., 1996].

Page 10: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary Text

Figure S5. List of investigator-named C. elegans genes sorted by periodic character. All gene models that were associated with investigator-assigned gene names are list in reverse order of bases with PATC>95 score in intron sequences. A list of investigator-assigned gene names and wormbase-derived models was derived from wormbase in May 2005. Periodicity characteristics were determined for each gene as described in Figure 5B. Fractions shown are for periodicity based on the PATC algorithm of Figure 3, with a cutoff or 95. Intron, exon, five prime flanking and 3' flanking sequences are analyzed in columns 3,4,5, and 6 respectively. Additional columns note other compiled parameters of each sequence using the PATC algorithm described in the software archive.

Page 11: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary Text

Figure S6. List of all C. elegans gene models with calculated periodicity characteristics. All gene models were analyzed as in Figure S5. Columns 1 and 2 are gene designations, Columns 3,4,5, and 6 are intron, exon, five prime flanking and three prime flanking scores, with sequence annotations as downloaded from wormbase. Note that many genes have been described by several gene models, in this case all are included in the table from the May 2005 version of wormbase.

Page 12: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary Text

Figure S7. Annotation-based assessment of association between strong periodicity and germline function for C. elegans genes. A comparison of periodicity properties is shown among a subset of C. elegans genes for which specific biological functions had been assigned by classical genetic means. To first obtain a set of genes which had been characterized independently of genome-wide approaches, we focused on a list of genes which were first isolated using classical genetic screens. In most cases, extensive phenotypic and molecular analysis of these loci was carried out independently of whole-genome expression analysis tools that have been available in the last few years. Note that this list should be considered representative rather than complete; in particular the list omits a small number of characterized genes for which the relevant wormbase bulk-download entries carried a missing or incorrect genetic name or for which the genomic extent of the gene was clearly mis-annotated. Genes were separated on the basis of position within the genome into a strongly-periodic subset (upper part of the figure; roughly the terminal 1/3 of each autosome plus the left tip of X) and the remaining component of the genome (lower part of figure). At left are shown the highest ranked genes (in terms of intron periodicity) in each set. The apparently non-periodic group of genes in each genome partition is quite large (several thousand genes in each set show no detectable periodicity in introns, exons, or flanking DNA). To avoid any systematic bias in this group, we selected at random within the non-periodic set, listing in the figure only those genes that meet the criteria for classical genetic characterization described above. Although the expression and activity pattern for these 139 genes are not completely known, a considerable body of information from gene-specific investigations is available for each gene. This information, referenced through wormbase and open-access literature articles therein and indicated by color (Black=Experimental data suggesting germline expression; Red: Lack of experimental data suggesting germline expression), was scanned for indications of any of the following properties: (a) A mutant phenotype which appeared likely to reflect a need for germ line expression of the wild-type gene (e.g., maternal effects on embryogenesis) (b) Evidence for germline expression from antibody staining or in situ hybridization (c) Expression of transgene reporter constructs in germline tissue (note that such experiments frequently fail to show germline expression for unknown reasons, so that negative results in transgene assays are not particularly indicative). Although all of these criteria have certain caveats and biases, they are certainly among the best available based on current technologies. We also note here that our gene-by-gene literature-based annotation of germline expression information was by nature imperfect. Of the 139 genes in the original scanned set, subsequent conversations with colleagues working on two of the genes (sex-1 [Barbara Meyer] and smu-2 [Robert Herman]) indicated that we had missed subtle points in the original literature. Given the nature of the analysis, this type of error is almost certain to underestimate the number of germline expressed genes in each set. Similarly, it should be stressed that currently published analysis of any given gene is always expected to be incomplete; thus we would certainly expect that additional genes from these lists will eventually be shown to express in germline tissue.

Page 13: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary Text

Figure S8. Periodicity profiles for 12 germline-expressed genes that have been investigated for possible adaptation to produce extrachromosomal expression vectors. For each of these genes, we show a schematic diagram of exons (wide bars) and introns (intervening narrow bars), annotated 5' and 3' UTR sequences as well as 1kb of upstream and downstream sequence annotated as nontranscribed (all annotations as of November 2005 in wormbase). Plots above each gene diagram show fraction of bases within a given 50-base region showing above threshold periodicity (PATC ≥95). Five of these six genetic regions (fem-1, rde-1, mes-1, glp-1, and dcr-1) were incapable of driving reporter expression in either simple tandem array transgenes [as described by Stinchcomb et al., 1995] or complex array transgenes [Kelly et al., 1997]. Although some functionality of each transgene construct was confirmed by rescue of the corresponding null mutation with the original (pre-tagging) DNA clone, there was no confirmation that this expression was in the germline. No germline reporter activity (GFP fluorescence) was observed for any of these five gfp-tagged constructs.

Page 14: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary Text

Figure S9. Prediction of unusual "bent" character in the C. elegans genome with a wide variety of computational parameters. Each line in the graph shows the distribution of predicted bend angles for bases in the C. elegans genome [11/2002] as a function of specific parameters used in the UT87 algorithm (see Table 1). Blue lines are predictions from C. elegans sequence, Red lines are from a random sequence of equivalent AT content. The Y axis shows the number of bases in the genome which the algorithm predicts will fall within a bent segment of the defined segment length. Two parameters were then varied, the cutoff value (in terms of angle) used to define a sequence as bent (X axis; represented by cosine of the angle), and the length of the segment window ('SEG') used. Separate curves are shown for segment lengths of 31, 61, 121, and 241 for genomic and random sequence. For each segment length a clear difference between C. elegans and random sequence is observed. The most dramatic differences are present with a medium-length segment (e.g., 61bp) which is short enough to provide a low probability of random bends while being long enough to allow the unusual character of C. elegans DNA to be evident in the predicted 3-dimensional path.

Page 15: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

Figure S-1A

Page 16: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

Figure S-1B

Page 17: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

Figure S-1C

Page 18: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

Figure S-1D

Page 19: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

Figure S-2A

Page 20: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

Figure S-2B

Page 21: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

15

16

17

18

19

20

21

22

23

24

25

26

27R

elat

ive

Coi

ncid

ence

Fre

quen

cy

10 20 30 40 50 60 70 80 90 100

110

120

130

140

150

160

170

180

190

200

210

220

230

240

250

Separation (base pairs)

Separation of An/Tn tracks (repeat-masked C. elegans)

Coincidence Frequency AAAA X(n-4) TTTT

Coincidence Frequency TTTT X(n-4) AAAAFigure S-3A

Page 22: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

Rel

ativ

e C

oinc

iden

ce F

requ

ency

10 20 30 40 50 60 70 80 90 100

110

120

130

140

150

160

170

180

190

200

210

220

230

240

250

Separation (base pairs)

Separation of An/Tn tracks (repeat-masked C. elegans)

Coincidence Frequency TTTT X(n-4) TTTT

Coincidence Frequency AAAA X(n-4) AAAA Figure S-3B

Page 23: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

0

2000

4000

6000

8000

10000

Bas

es w

ith

indi

cate

d P

AT

C w

indo

w s

ize

50 100

150

200

250

300

350

400

450

500

550

600

650

Window Size (base pairs)

Long-Range Periodicity: PATC Window Lengths

Figure S-4A

Page 24: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

0.18

0.20

0.22

0.24

0.26

0.28

0.30

0.32

0.34

0.36

Coi

ncid

ence

Sco

re (

arbi

trar

y un

its)

0 50 100

150

200

250

300

350

400

450

500

550

600

650

Separation (base pairs)

Long Range Periodicity: Autosome Arms (All Intron Combinations)

Figure S-4B

Page 25: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

0.18

0.20

0.22

0.24

0.26

0.28

0.30

0.32

0.34

0.36

Coi

ncid

ence

Sco

re (

arbi

trar

y un

its)

0 50 100

150

200

250

300

350

400

450

500

550

600

650

Separation (base pairs)

Long-Range Periodicity: Autosome Arms (Intra-Intron)

Figure S-4C

Page 26: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

0.00

0.05

0.10

0.15

0.20C

oinc

iden

ce S

core

(ar

bitr

ary

unit

s)

50 100

150

200

250

300

350

400

450

500

550

600

650

Separation (base pairs)

Long-Range Periodicity: Autosome Arms (Inter-Intron)

Figure S-4D

Page 27: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

0

200

400

600

800

Fou

rier

Val

ue (

arbi

trar

y un

its)

8.0

8.5

9.0

9.5

10.0

10.5

11.0

11.5

12.0

Frequency (base pairs)

Periodic Character of C. elegans introns: Fourier Analysis

Figure S-4E

Page 28: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

Control group: random subset of genes with P=0

Periodic group: genes with highest PATC score

Stro

ng p

erio

dic

regi

ons

(Aut

osom

e arm

s)

Wea

kly

peri

odic

regi

ons

(Aut

osom

e ctr

s + X

)

zyg-12mig-5rme-1ced-13tra-2unc-84fem-1mut-7sex-1

elt-3pbo-4cup-4rol-6egl-47sur-7lon-1egl-13mec-6mec-4

mec-7sma-3daf-6vab-15unc-29let-413unc-10mab-21sma-5dpy-8

slt-1unc-38egl-46sem-4elt-2mup-2ram-5unc-129

unc-11pro-1tab-1lin-36unc-57osm-5daf-16sel-5

tam-1evl-14mog-5aph-1ama-1mom-4gro-1

ced-6mat-2mes-3sid-1zyg-9fem-3clk-1

mom-5dna-2mau-2aph-2let-502sos-1mel-26

par-2smu-2mrt-2mel-46ced-10sqv-2hmp-2ced-2apx-1sqv-6

ksr-2dpy-28par-4par-6fog-1zyg-8pop-1smg-2unc-45

pie-1sur-2unc-60ced-9mog-4cul-2fem-2dpy-21unc-59

ced-1unc-101unc-33rol-1ben-1lin-7unc-34unc-26lin-42

cat-4aex-3unc-112egl-1unc-47pha-2let-70act-2sel-10ceh-22

egl-30mab-7daf-3lin-25sqt-2daf-5daf-1gon-1gon-4

hlh-1ces-2pat-6des-2nob-1egl-9egl-26epi-1osm-6

sqv-4emo-1gut-2mep-1

Figure S-7

Page 29: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

fem-1 rde-1 ama-1 let-858

gna-2 mes-1 pie-1 smu-2

glp-1 dcr-1 mex-3 smu-1

Figure S-8

Page 30: Supplementary Material for Unusual DNA structures ... · A. Fire, R. Alcazar, and F. Tan DNA structure and Germline Activity in C. elegans Genetics-Online April 2006 Supplementary

100

1000

10000

100000

1000000

10000000B

ases

in g

enom

e

-1

-0.8

-0.6

-0.4

-0.20

0.2

0.4

0.6

0.81

Bending Metric [=Cos(Ø)]

Predicted bending as a function of segment length

C. elegans DNA. Seg=241

C. elegans DNA. Seg=121

C. elegans DNA. Seg=61

C. elegans DNA. Seg=31

Random DNA. Seg=241

Random DNA. Seg=121

Random DNA. Seg=61

Random DNA. Seg=31

31

61

121

241

31

61

121

241

Ø=0° No Bend

Ø=180° Maximum Bend

Ø

DNA path31,61,121,or 241 nt

Angle

Figure S-9


Recommended