+ All Categories
Home > Documents > Supporting Information - PNAS · Arachne. Sequence data were assembled using Arachne 2.0.1 on a...

Supporting Information - PNAS · Arachne. Sequence data were assembled using Arachne 2.0.1 on a...

Date post: 15-May-2020
Category:
Upload: others
View: 10 times
Download: 0 times
Share this document with a friend
14
Supporting Information Opperman et al. 10.1073/pnas.0805946105 SI Methods Library Construction. Meloidogyne hapla eggs were harvested from roots of tomato plants and DNA was isolated by using standard protocols. The one library sequenced at the Center for the Biology of Nematode Parasitism (CBNP), CHOHAP, was made by Orion. The CHOHAP library, made from sheared genomic DNA ligated into a modified pBluescript vector (Stratagene) had an average insert size of 3 kb. The FASU library, made from sheared DNA ligated into the vector pUC18, had an average insert size of 3 kb. The libraries FGHA and FGHB, made with sheared DNA ligated into the vector pMCL200, had average insert sizes of 6 and 8 kb, respectively. Two Fosmid libraries with average insert sizes of 40 kb were made by ligating sheared genomic DNA into the vector pCC1Fos. Details of library production can be found at http://www.jgi.doe.gov/sequencing/ protocols/index.html. Sequence Production. All sequencing was performed by using standard protocols on either ABI 3730 or MegaBase sequence analyzers. See supporting information (SI) Table S1 for a breakdown of sequence statistics. Sequence Analysis. Sequences were made available by the De- partment of Energy Joint Genome Institute (JGI) in SCF format. Quality scores were automatically generated at JGI by using either KB basecaller version 1.2 or Phred version 0.020425.c, depending on whether sequences were analyzed with an ABI 3730 or a MegaBase sequencer, respectively. All se- quencing at CBNP was performed on ABI 3730 DNA analyzers, and quality scores were generated by KB version 1.2. FastA and quality scores files were extracted from SCF files by using Phred. After extraction of FastA and quality files we used a series of tools to prepare the data for assembly. First, we used Crossmatch run on the Sun Grid Engine to compare all of the sequences with the cloning vectors, host Escherichia coli genome, and 14 com- plete nematode mitochondrial genomes available on the Na- tional Center for Biotechnology Information (NCBI) database. Accession numbers for the 14 mitochondrial genomes are AJ417718, X54253, NC 004298, AC186293, AC186293, NC004806, AJ537512, AJ417719, AJ556134, NC001861, AY591323, AY591323, NC002681, and NC002681. Vector sequences were trimmed by using an algorithm developed at the CBNP. Sequences matching either mitochondrial or E. coli sequences were removed from our dataset. Sequences of 10 bp left after trimming were eliminated from further analysis. A series of ad hoc Perl scripts were used to prepare the data for Arachne. This included renaming individual sequence names to give each sequence a well ID, provide information about sequencing primer used, and provide insert size data for Arachne. Sequence data were assembled using Arachne 2.0.1 on a Dell workstation running Red Hat Enterprise Linux 5.0 with 16 GB of total RAM. The data were assembled by using a series of different parameters in an attempt to optimize assembly (data not shown). The current build included many of the parameters reported to enhance supercontig size and accuracy. We used the Arachne option removetinysupers to eliminate any supercontigs 2,000 bp in length. RepeatMasker. The assembled contig sequences were processed with RepeatMasker version 3.0.8 (1) with the cross-match engine, version 0.990329, using the RepBase Update repeat library, version 9.04, to identify any Transposable Elements (TEs). The options used were: poly, gff, ace, excln, and species. The ‘‘species’’ option was set to ‘‘elegans.’’ RepeatScout. The assembled contig sequences were processed with RepeatScout version 1.0.2 by using a standard protocol for the processing of data with RepeatScout that was developed in-house from the RepeatScout documentation. Step five of the protocol was repeated five times by using the threshold values of 10 (default), 20, 30, 40, and 50, and we did not use step six of the program. Glimmer. PASA (Program to Assemble Spliced Alignments) was run by using a combination of M. hapla genomic sequence and tylenchid EST data to produce a training set for gene modeling (2). ESTs with alignments 90% identity and 90% coverage with proper exon/intron boundaries were used to create EST assem- blies. After validation, there were 37,605 EST alignments that assembled into 5,887 assemblies representing 5,477 loci. Putative full-length assemblies with ORFs 100 aa and more than one exon were used as the training set for Glimmer. The gff file containing the PASA gene predictions was parsed for exon start and end positions for each gene. The exon positions and genome assembly were used as a training set for GlimmerHMM 3.0.1 (3). The trained data were then used to scan the entire assembly for predicted genes. Intron/Exon Statistics. Gene structure statistics were compiled by using 3,126 gene models that were unequivocally validated by using full-length EST assembly data. A Perl script was written to parse the gene model file produced by PASA. This script gathered statistics regarding the number and size of exons and introns for each of the tier 1 genes. The results were read into R and summary statistics were obtained. Merging M. hapla Genomic Sequence with Genetic Map. Integration of the genetic map with the draft sequence assembly was performed by comparing the sequences of the amplified frag- ment length polymorphism (AFLP) markers to the genomic contigs. Sequences of 30 AFLP marker sequences were com- pared with the contigs from the latest 10 draft sequences by using BlastN. Marker sequences were used as query sequences, and true contigs from the 10 draft assembly of the genomic sequence were used as target sequences. Low-complexity filter- ing was turned off for this blast job. An E value of E-10 was used as a cutoff. Up to 100 matches to the genomic sequence were returned. Operons and Synteny. A comparison was done between known operons in Caenorhabditis elegans and the M. hapla genomic dataset. A WormMart query provided 4,685 C. elegans proteins representing genes found in operons across the C. elegans genome. These proteins were used as queries in a GeneDetective search of the contigs from the M. hapla genomic assembly with a significance of 1.0E-05. Perl scripts were used to identify multiple members of C. elegans operons found on the same contig in the M. hapla assembly. Horizontal Gene Transfer. Genes from the Initial Freeze were used as queries in multiple homology searches by using the filtering methods described in Scholl et al. (4) Results of the filtering steps were manually curated for final analysis. Opperman et al. www.pnas.org/cgi/content/short/0805946105 1 of 14
Transcript
Page 1: Supporting Information - PNAS · Arachne. Sequence data were assembled using Arachne 2.0.1 on a Dell workstation running Red Hat Enterprise Linux 5.0 with ... R and summary statistics

Supporting InformationOpperman et al. 10.1073/pnas.0805946105SI MethodsLibrary Construction. Meloidogyne hapla eggs were harvested fromroots of tomato plants and DNA was isolated by using standardprotocols. The one library sequenced at the Center for theBiology of Nematode Parasitism (CBNP), CHOHAP, was madeby Orion. The CHOHAP library, made from sheared genomicDNA ligated into a modified pBluescript vector (Stratagene) hadan average insert size of 3 kb. The FASU library, made fromsheared DNA ligated into the vector pUC18, had an averageinsert size of 3 kb. The libraries FGHA and FGHB, made withsheared DNA ligated into the vector pMCL200, had averageinsert sizes of 6 and 8 kb, respectively. Two Fosmid libraries withaverage insert sizes of �40 kb were made by ligating shearedgenomic DNA into the vector pCC1Fos. Details of libraryproduction can be found at http://www.jgi.doe.gov/sequencing/protocols/index.html.

Sequence Production. All sequencing was performed by usingstandard protocols on either ABI 3730 or MegaBase sequenceanalyzers. See supporting information (SI) Table S1 for abreakdown of sequence statistics.

Sequence Analysis. Sequences were made available by the De-partment of Energy Joint Genome Institute (JGI) in SCFformat. Quality scores were automatically generated at JGI byusing either KB basecaller version 1.2 or Phred version0.020425.c, depending on whether sequences were analyzed withan ABI 3730 or a MegaBase sequencer, respectively. All se-quencing at CBNP was performed on ABI 3730 DNA analyzers,and quality scores were generated by KB version 1.2. FastA andquality scores files were extracted from SCF files by using Phred.

After extraction of FastA and quality files we used a series oftools to prepare the data for assembly. First, we used Crossmatchrun on the Sun Grid Engine to compare all of the sequences withthe cloning vectors, host Escherichia coli genome, and 14 com-plete nematode mitochondrial genomes available on the Na-tional Center for Biotechnology Information (NCBI) database.Accession numbers for the 14 mitochondrial genomes areAJ417718, X54253, NC�004298, AC186293, AC186293,NC�004806, AJ537512, AJ417719, AJ556134, NC�001861,AY591323, AY591323, NC�002681, and NC�002681. Vectorsequences were trimmed by using an algorithm developed at theCBNP. Sequences matching either mitochondrial or E. colisequences were removed from our dataset. Sequences of �10 bpleft after trimming were eliminated from further analysis.

A series of ad hoc Perl scripts were used to prepare the datafor Arachne. This included renaming individual sequence namesto give each sequence a well ID, provide information aboutsequencing primer used, and provide insert size data forArachne. Sequence data were assembled using Arachne 2.0.1 ona Dell workstation running Red Hat Enterprise Linux 5.0 with16 GB of total RAM. The data were assembled by using a seriesof different parameters in an attempt to optimize assembly (datanot shown). The current build included many of the parametersreported to enhance supercontig size and accuracy. We used theArachne option remove�tiny�supers to eliminate any supercontigs�2,000 bp in length.

RepeatMasker. The assembled contig sequences were processedwith RepeatMasker version 3.0.8 (1) with the cross-matchengine, version 0.990329, using the RepBase Update repeatlibrary, version 9.04, to identify any Transposable Elements

(TEs). The options used were: poly, gff, ace, excln, and species.The ‘‘species’’ option was set to ‘‘elegans.’’

RepeatScout. The assembled contig sequences were processedwith RepeatScout version 1.0.2 by using a standard protocol forthe processing of data with RepeatScout that was developedin-house from the RepeatScout documentation. Step five of theprotocol was repeated five times by using the threshold values of10 (default), 20, 30, 40, and 50, and we did not use step six of theprogram.

Glimmer. PASA (Program to Assemble Spliced Alignments) wasrun by using a combination of M. hapla genomic sequence andtylenchid EST data to produce a training set for gene modeling(2). ESTs with alignments �90% identity and 90% coverage withproper exon/intron boundaries were used to create EST assem-blies. After validation, there were 37,605 EST alignments thatassembled into 5,887 assemblies representing 5,477 loci.

Putative full-length assemblies with ORFs �100 aa and morethan one exon were used as the training set for Glimmer. The gfffile containing the PASA gene predictions was parsed for exonstart and end positions for each gene. The exon positions andgenome assembly were used as a training set for GlimmerHMM3.0.1 (3). The trained data were then used to scan the entireassembly for predicted genes.

Intron/Exon Statistics. Gene structure statistics were compiled byusing 3,126 gene models that were unequivocally validated byusing full-length EST assembly data. A Perl script was written toparse the gene model file produced by PASA. This scriptgathered statistics regarding the number and size of exons andintrons for each of the tier 1 genes. The results were read intoR and summary statistics were obtained.

Merging M. hapla Genomic Sequence with Genetic Map. Integrationof the genetic map with the draft sequence assembly wasperformed by comparing the sequences of the amplified frag-ment length polymorphism (AFLP) markers to the genomiccontigs. Sequences of 30 AFLP marker sequences were com-pared with the contigs from the latest 10� draft sequences byusing BlastN. Marker sequences were used as query sequences,and true contigs from the 10� draft assembly of the genomicsequence were used as target sequences. Low-complexity filter-ing was turned off for this blast job. An E value of E-10 was usedas a cutoff. Up to 100 matches to the genomic sequence werereturned.

Operons and Synteny. A comparison was done between knownoperons in Caenorhabditis elegans and the M. hapla genomicdataset. A WormMart query provided 4,685 C. elegans proteinsrepresenting genes found in operons across the C. elegansgenome. These proteins were used as queries in a GeneDetectivesearch of the contigs from the M. hapla genomic assembly witha significance of 1.0E-05. Perl scripts were used to identifymultiple members of C. elegans operons found on the samecontig in the M. hapla assembly.

Horizontal Gene Transfer. Genes from the Initial Freeze were usedas queries in multiple homology searches by using the filteringmethods described in Scholl et al. (4) Results of the filtering stepswere manually curated for final analysis.

Opperman et al. www.pnas.org/cgi/content/short/0805946105 1 of 14

Page 2: Supporting Information - PNAS · Arachne. Sequence data were assembled using Arachne 2.0.1 on a Dell workstation running Red Hat Enterprise Linux 5.0 with ... R and summary statistics

Identification of Known Secretion Genes. A total of 80 sequenceswere obtained from GenBank by searching the literature forreported secreted proteins from various plant-parasitic nema-tode species. Some redundancy may exist in the list identified, asmultiple family members are listed for some genes. Redundancywas limited by using sequences from the most closely relatedspecies when genes were found described in several differentplant-parasitic nematode species. Sequences were translated insix frames and compared the entire set of 3,452 genomic contigstranslated in six frames by using Blast. A match was returned ifthe E value was less than E-03. The relatively high E value cutoffwas used to allow identification of the most complete set ofputative secreted proteins in M. hapla.

Identification of Putative Secreted Genes de Novo. For these anal-yses we used peptide sequences from HapPep3. First, we pre-dicted putative signal sequences by using SignalP. We used twoavailable methods to predict signal sequences, SignalP-NN (neu-ral networks) and SignalP-HMM (using hidden Markov models).Only predicted proteins that had a putative signal sequence byusing both methods were considered likely candidates. To ex-clude transmembrane proteins, we used the membrane topologyprediction program TM-HMM. Proteins determined to encodesecreted proteins were then compared with the proteins inWormPep version 185. All sequences that did not have a matchto C. elegans proteins were then compared with the NCBInonredundant protein dataset NR using BlastX to assign anno-tations to those predicted proteins. All Blast analyses requiredan E value of less than E-10 to be considered a significant match.

G Protein-Coupled Receptors. C. elegans G-Protein Coupled Re-ceptor (GPCR) protein sequences were collected from http://www.gpcr.org/7tm/. Sequence data were manually obtained byfollowing a trail of links for each entry until FASTA resourcescould be found. Entries with the following SWISS-PROT iden-tifiers were skipped, because their FASTA sequences could notbe found at the time of sequence collection: O45338�CAEEL,O45339�CAEEL, O45967�CAEEL, O45984�CAEEL,O62076�CAEEL, O62506�CAEEL, P91439�CAEEL,Q18687�CAEEL, Q21690 CAEEL, Q23265�CAEEL,Q3Y401�CAEEL, Q7M3K7�CAEEL, Q7M3K9�CAEEL,Q8MTW9�CAEEL, Q963E7�CAEEL, Q965H7�CAEEL,YWO4�CAEEL. The C. elegans GPCR protein sequence data-set contained 1,011 sequences.

The C. elegans GPCR protein sequences were used in a tera-blastp against the M. hapla gene freeze translated in all six framesat E values of 10E-05, 10E-10, 10E-15, and 10E-20 with and withouta low-complexity filter on. The resulting BLAST results sets wereimported into a spreadsheet where the number of top-hits andnon-hits were tabulated. Protein names of top- and non-hits wereplaced into categories based on common keywords. Any proteinname lacking descriptive text or containing the phrase ‘‘Tempo-rarily Assigned Gene name family member’’ was placed into the‘‘UNCHARACTERIZED/OTHER’’ category.

Nuclear Hormone Receptors. C. elegans Nuclear Hormone Recep-tor (NHR) protein sequences were collected from http://www.ncbi.nlm.nih.gov/. Duplicate entries and entries represent-ing proteins from nonunique genomic locations were eliminated.The C. elegans NHR protein sequence dataset, which contained170 sequences, was used in a tera-blastp against the M. haplagene freeze and processed as described for the GPCRs above.

Collagens. C. elegans collagen protein sequences were collectedfrom http://www.wormbase.org/. After duplicate entries andentries from nonunique genomic locations were eliminated, theC. elegans collagen protein sequence dataset contained 182sequences. This dataset was further reduced by removing allsequences that did not contain a minimum of one instance of a(Gly-X-Y)3 motif. Manually removing these entries resulted in aset of 165 C. elegans collagen protein sequences.

A Perl script was written that counts the number of sequenceswith a minimum of one instance of an uninterrupted (Gly-X-Y)nmotif, where Gly represents glycine, X represents any amino acid(including Glycine), and Y represents any amino acid exceptglycine. The restriction on Y was imposed to avoid findingsequences with only a very low complexity glycine repeat (i.e.,GGGGGGGGG). The number of sequences meeting the min-imum of one instance of an uninterrupted (Gly-X-Y)n motifcriterion was tabulated for three protein sets: C. elegans proteinset from http://www.sanger.ac.uk/Projects/C�elegans/Science98/,October1998 (19,099 proteins total); C. elegans collagen proteinset from Wormbase (http://www.wormbase.org/), cross-validated(165 proteins total); and M. hapla HapPepf1v1 (13,336 proteinstotal). Finding the n values for which all (or almost all) proteinsfrom the C. elegans reference set of 165 met the cutoff andfinding the corresponding number of M. hapla proteins that metthe cutoff was used as one line of evidence for the true numberof M. hapla collagen proteins. Second, a manual (visual) screenwas performed using n � {4, 5, 6, 7}, partitioning the proteinsinto ‘‘true collagens’’ and ‘‘questionable/false collagens.’’ The C.elegans collagen protein set of 165 was used as a reference for thevisual comparison. Third, tera-blastp was used with the C.elegans reference set of 165 collagen proteins as the query, usingE values of E-05, E-10, E-15, E-20, and E-25, with and withouta low-complexity filter on. Hits from selected resulting sets werecompared against the manually screened set of M. hapla proteinsthat were initially filtered down by using the ‘‘minimum of oneinstance of an uninterrupted (Gly-X-Y)n motif. Those proteinsidentified by tera-blastp which were not identified by the manualscreen were compared. The combined methods were used toarrive at a confident number and list of true M. hapla collagenproteins from HapPep.

GO Analysis. Each gene was used as a query in a GeneDetectivesearch against the Uniprot (swissprot � trembl) protein database.The top 10 results with a significance of 1.0E-15 were reported. Ascript was then used to query a database containing associationsbetween Uniprot identifiers and GO numbers. The script looks fora GO association for the best match to Uniprot. If it does not findone, it continues to the next match, until it finds the best availablematch to Uniprot that has an associated GO number. The GOnumbers are then used to re-create the GO tree, including countsfor each of the categories traversing the tree.

Protein Domain Analysis. A protein dataset was constructed from theM. hapla Freeze. These proteins were used as queries in an HMMsearch against the Pfam22 database. The top 10 matches with asignificance of E-05 were reported back. The top match for eachquery was used to examine the most common protein domainsidentified in the M. hapla protein dataset. The same analysis wasrepeated by using wormpep185 for comparison. In addition, byusing the top 10 matches, a distribution of the number of domainsidentified in each protein sequence was acquired.

1. Smit AFA, Hubley R, Green P (2004) RepeatMasker Open-3.0. Available at http://www.repeatmasker.org.

2. Haas BJ, et al. (2003) Improving the Arabidopsis genome annotation using maximaltranscript alignment assemblies. Nucleic Acids Res 31:5654–5666.

3. Salzberg SL, Pertea M, Delcher AL, Gardner MJ, Tettelin H (1999) Interpolated Markovmodels for eukaryotic gene finding. Genomics 59:24–31.

4. Scholl EH, Thorne JL, McCarter JP, Bird DMcK (2003) Horizontally transferred genes inplant-parasitic nematodes: A high-throughput genomic approach. Genome Biol4:R39.1–R39.12.

Opperman et al. www.pnas.org/cgi/content/short/0805946105 2 of 14

Page 3: Supporting Information - PNAS · Arachne. Sequence data were assembled using Arachne 2.0.1 on a Dell workstation running Red Hat Enterprise Linux 5.0 with ... R and summary statistics

Fig. S1. Repetitive elements in the M. hapla genome consist of mostly low-complexity or simple repeats. �1% of the genome is classified as other, includingmobile elements.

Opperman et al. www.pnas.org/cgi/content/short/0805946105 3 of 14

Page 4: Supporting Information - PNAS · Arachne. Sequence data were assembled using Arachne 2.0.1 on a Dell workstation running Red Hat Enterprise Linux 5.0 with ... R and summary statistics

Fig. S2. G protein-coupled receptors are greatly reduced in M. hapla compared with C. elegans. Many have unknown function, and others encode typicalnervous system components involved in signal transduction.

Opperman et al. www.pnas.org/cgi/content/short/0805946105 4 of 14

Page 5: Supporting Information - PNAS · Arachne. Sequence data were assembled using Arachne 2.0.1 on a Dell workstation running Red Hat Enterprise Linux 5.0 with ... R and summary statistics

MhGenomic Contig344

M. hapla Predicted Protein

M. hapla EST Unigenes

NCBI Protein

AAR37374.1 AAC48327.1

GENBANK ACCESSION DESCRIPTIONAAR37374.1 Cellulase [Meloidogyne incognita]AAC48327.1 beta-1,4-endoglucanase-1 precursor [Heterodera glycines]

1k 2k k5k4k3

Fig. S3. Cellulase genes in M. hapla are clustered. In this case, two cellulase genes are in close proximity on the same M. hapla contig.

Opperman et al. www.pnas.org/cgi/content/short/0805946105 5 of 14

Page 6: Supporting Information - PNAS · Arachne. Sequence data were assembled using Arachne 2.0.1 on a Dell workstation running Red Hat Enterprise Linux 5.0 with ... R and summary statistics

Contig 809

F13G3.3 dylt-1 ttx-7 F13G3.6

Mh F13G3.6

MISSING

Mh ttx-7Mh dylt-1

MISSING

Mh F13G3.3

MISSING

CEOP1388

C. elegans Chromosome III 5,254,429 – 8,326,497

Contig 116 25,482 - 62,591M. hapla Operons

Contig 116 25,482 - 37,366

M. hapla Operon

Mh gop-1 Mh gop-2 Mh gop-3 Mh hap-1

Contig 840

Mh gro-1

Ce gop-1 gop-2 gop-3 hap-1 gro-1

CEOP3272 CEOP3508CEOP3424

dpy-30 ZK836.2

Contig 348

Mh dpy-30

CEOP5328

rnp-1

Mh rnp-1

A

B

C

Fig. S4. Comparison of C. elegans and M. hapla genome organization reveals conserved macro- and microsynteny. (A) Gene order of dpy-30 and rnp-1 in C.elegans operon CEOP5328 is conserved in M. hapla, although Mh-dpy-30 has two additional introns compared to Ce-dpy-30, and apparent exon shuffling hasmoved the last exon of Mh-dpy-30 to within an Mh-rnp-1 intron. (B) Three operons from a 3-Mb span of C. elegans chromosome III have at least partialconservation on a 37-kb span of M. hapla contig 116. Four of the five genes in CEOP3272 have conserved order in M. hapla (dark blue) with the fifth found ona separate contig (light blue). Two genes of C. elegans operon CEOP3414 are conserved with M. hapla, and operon CEOP3508 is completely conserved. (C) Operonloss demonstrated by CEOP1388. Three of the four genes appear absent from the M. hapla genome.

Opperman et al. www.pnas.org/cgi/content/short/0805946105 6 of 14

Page 7: Supporting Information - PNAS · Arachne. Sequence data were assembled using Arachne 2.0.1 on a Dell workstation running Red Hat Enterprise Linux 5.0 with ... R and summary statistics

Table S1. Whole-genome shotgun sequencing of M. hapla

LibrarySequencing

center

Insertsize,kb

Totalreads

Manualscreen*

Arachneexcluded

by quality†

Arachneexcluded

byprescreenc

Totalexcluded

Totalremainingsequences

forassembly

CHOHAP CBNP 3 96,383 2 56,936 289 57,227 39,156FASU JGI 3 423,914 1,734 34,431 10,099 46,264 377,650FGHA JGI 6 410,998 14,800 26,786 27,723 69,309 347,689FGHB JGI 8 12,235 394 965 1,114 2,473 9,762FGHC JGI 40 39,168 2,003 4,944 520 7,467 31,701FGIZ JGI 40 30,693 5,182 3,857 187 9,226 21,467Total 1,013,391 24,115 127,919 39,932 191,966 827,425

*Raw sequences were run through a series of software tools to remove contaminating sequences. See text for details.†Sequences excluded based on quality by using algorithm used by Arachne.‡Sequences excluded based on criteria provided to Arachne config files.

Opperman et al. www.pnas.org/cgi/content/short/0805946105 7 of 14

Page 8: Supporting Information - PNAS · Arachne. Sequence data were assembled using Arachne 2.0.1 on a Dell workstation running Red Hat Enterprise Linux 5.0 with ... R and summary statistics

Table S2. Repetitive element summary for M. hapla

file name: Mh10 g200708.Contigs.fassequences: 3,452total length: 53,017,507 bp (53,017,507 bp excluding N-runs)GC level: 27.40%bases masked: 9,723,165 bp (18.34%)

No. ofelements

Lengthoccupied, bp

Percentage ofsequence, %

Retroelements 204 66,764 0.13SINEs: 0 0 0.00

Penelope 0 0 0.00LINEs: 87 11,469 0.02

CRE/SLACS 0 0 0.00L2/CR1/Rex 87 11,469 0.02R1/LOA/Jockey 0 0 0.00R2/R4/NeSL 0 0 0.00RTE/Bov-B 0 0 0.00L1/CIN4 0 0 0.00

LTR elements: 117 55,295 0.10BEL/Pao 35 3,202 0.01Ty1/Copia 0 0 0.00Gypsy/DIRS1 82 52,093 0.10

Retroviral 0 0 0.00

DNA transposons 329 33,653 0.06hobo-Activator 2 343 0.00Tc1-IS630-Pogo 323 32,877 0.06En-Spm 0 0 0.00MuDR-IS905 0 0 0.00PiggyBac 1 88 0.00Tourist/Harbinger 1 109 0.00Other (Mirage, P-element, Transib) 0 0 0.00

Rolling circles 0 0 0.00Unclassified: 0 0 0.00Total interspersed repeats 100,417 0.19Small RNA: 27 1,960 0.00Satellites: 53 9,984 0.02Simple repeats: 6,258 500,192 0.94Low complexity: 155,054 9,104,527 17.17

Most repeats fragmented by insertions or deletions have been counted as one element. Runs of �20 Ns in querywere excluded in % calculated. The query species was assumed to be Caenorhabditis.RepeatMasker version open-3.0.8, default mode. Run with cross�match version 0.990329. RepBase Update 9.04,RM database version 20040702.

Opperman et al. www.pnas.org/cgi/content/short/0805946105 8 of 14

Page 9: Supporting Information - PNAS · Arachne. Sequence data were assembled using Arachne 2.0.1 on a Dell workstation running Red Hat Enterprise Linux 5.0 with ... R and summary statistics

Table S3. Known secretion genes found in the M. hapla genome

GenBank accession no. Description

DQ841123 M. hapla parasitism protein 16D10AY861685 M. incognita pectate lyase 3 (pel3)AJ557572 M. incognita putative cathepsin L protease (cpl-1 gene)AY515703 M. incognita pectate lyase 2 (pel2)AY515702 M. incognita pectate lyase 1 (pel1)AY422837 M. incognita cellulase (eng4)AY422836 M. incognita cellulase (eng3)AY422833 M. incognita putative esophageal gland cell secretory protein 40 (msp40)AY422832 M. incognita putative esophageal gland cell secretory protein 39 (msp39)AY422831 M. incognita putative esophageal gland cell secretory protein 38 (msp38)AY422830* M. incognita putative esophageal gland cell secretory protein 37 (msp37)AY422829 M. incognita putative esophageal gland cell secretory protein 36 (msp36)AY134435* M. incognita putative esophageal gland cell secretory protein 16 (msp16)AY142121 M. incognita putative esophageal gland cell secretory protein 31 (msp31)AY142120* M. incognita putative esophageal gland cell secretory protein 30 (msp30)AY142119 M. incognita putative esophageal gland cell secretory protein 35 (msp35)AY142118 M. incognita putative esophageal gland cell secretory protein 33 (msp33)AY142117 M. incognita putative esophageal gland cell secretory protein 34 (msp34)AY142116 M. incognita putative esophageal gland cell secretory protein 32 (msp32)AY135365 M. incognita putative esophageal gland cell secretory protein 29 (msp29)AY135364 M. incognita putative esophageal gland cell secretory protein 28 (msp28)AY135363 M. incognita putative esophageal gland cell secretory protein 27 (msp27)AY135362 M. incognita putative esophageal gland cell secretory protein 26 (msp26)AY134444 M. incognita putative esophageal gland cell secretory protein 25 (msp25)AY134443* M. incognita putative esophageal gland cell secretory protein 24 (msp24)AY134442 M. incognita putative esophageal gland cell secretory protein 23 (msp23)AY134441 M. incognita putative esophageal gland cell secretory protein 22 (msp22)AY134440 M. incognita putative esophageal gland cell secretory protein 21 (msp21)AY134439 M. incognita putative esophageal gland cell secretory protein 20 (msp20)AY134438 M. incognita putative esophageal gland cell secretory protein 19 (msp19)AY134437 M. incognita putative esophageal gland cell secretory protein 18 (msp18)AY134436 M. incognita putative esophageal gland cell secretory protein 17 (msp17)AY134434 M. incognita putative esophageal gland cell secretory protein 15 (msp15)AY134433 M. incognita putative esophageal gland cell secretory protein 14 (msp14)AY134432 M. incognita putative esophageal gland cell secretory protein 13 (msp13)AY134431 M. incognita putative esophageal gland cell secretory protein 12 (msp12)AF531170 M. incognita putative esophageal gland cell secretory protein 10 (msp10)AF531169 M. incognita putative esophageal gland cell secretory protein 9 (msp9)AF531168* M. incognita putative esophageal gland cell secretory protein 8 (msp8)AF531167 M. incognita putative esophageal gland cell secretory protein 11 (msp11)AF531166 M. incognita putative esophageal gland cell secretory protein 7 (msp7)AF531165 M. incognita putative esophageal gland cell secretory protein 6 (msp6)AF531164 M. incognita putative esophageal gland cell secretory protein 5 (msp5)AF531163 M. incognita putative esophageal gland cell secretory protein 4 (msp4)AF531162 M. incognita putative esophageal gland cell secretory protein 3 (msp3)AF531161 M. incognita putative esophageal gland cell secretory protein 2 (msp2)AF531160 M. incognita putative esophageal gland cell secretory protein 1 (msp1)AJ311902 G. rostochiensis EXPB2 proteinAY288520 H. schachtii ubiquitin extension protein 2AY286305 H. schachtii ubiquitin extension protein (Ubi1)AF502391* H. glycines putative gland protein G10A06AF500024 H. glycines putative gland protein G8H07AF469060 H. glycines ubiquitin extension proteinAF469059 H. glycines annexin 4C10AF469058* H. glycines cellulose binding proteinAF468679* H. glycines chitinaseAJ493677 G. rostochiensis secreted glutathione peroxidase (gpx1)AY098646 M. incognita polygalacturonaseAY026357 H. glycines pectinase precursorAF344862*. H. glycines putative salivary proline-rich glycoprotein precursor Hgg-15AF402771 M. incognita calreticulinAF402309 M. incognita 14–3-3 productAF402308 M. incognita myosin regulatory light chainAF374388 H. glycines vap-1

Opperman et al. www.pnas.org/cgi/content/short/0805946105 9 of 14

Page 10: Supporting Information - PNAS · Arachne. Sequence data were assembled using Arachne 2.0.1 on a Dell workstation running Red Hat Enterprise Linux 5.0 with ... R and summary statistics

GenBank accession no. Description

AY033601 H. glycines venom allergen-like protein (vap-2)AF323086 M. incognita beta-1,4-endoglucanase (eng-2)AJ300178 G. pallida annexin 2 (nex-2 gene)AF273728* H. glycines hypothetical esophageal gland cell secretory protein 1 (hsp1)AJ251758* G. rostochiensis hypothetical protein (clone A41)AJ251757* G. rostochiensis hypothetical protein (clone A18)AF273735 H. glycines hypothetical esophageal gland cell secretory protein 8 (hsp8)AF224342 M. incognita xylanase (xyl-1)AJ271910* G. rostochiensis putative hypodermis secreted protein (sxp1 gene)AJ270995* G. rostochiensis putative amphid protein (ams1 gene)AF100549 M. incognita beta-1,4-endoglucanase (eng-1)AJ243736 G. rostochiensis peroxiredoxin (tpx gene)Y09293 G. pallida putative SEC-2 proteinAF095949 M. javanica chorismate mutase (NC30)AF013289 M. incognita secreted protein MSP-1 (msp-1)AF049139 M. incognita cellulose binding protein precursor cbp-1

*Gene not identified.

Opperman et al. www.pnas.org/cgi/content/short/0805946105 10 of 14

Page 11: Supporting Information - PNAS · Arachne. Sequence data were assembled using Arachne 2.0.1 on a Dell workstation running Red Hat Enterprise Linux 5.0 with ... R and summary statistics

Table S4. M. hapla orthologs to C. elegans daf genes

C. elegans gene Chromosome Mh contig E value Protein

daf-8 I 87 -08 Smaddaf-16 I 353 -36 forkhead TF

daf-5 II — — chemo sens. Proline-richdaf-19 II 1476 -17 RFX TF

daf-2 III 133 -35 Ins/IGF orthdaf-4 III 1302 -26 TGF-� recdaf-7 III 34 -06 BMP/TGF-�

daf-1 IV 523 -23 TGF-� recdaf-10 IV 1431 -34 WD-WAA repdaf-14 IV 871 -04 Smaddaf-15 IV 1249 -34 RAPTOR orthdaf-18 IV 1095 -34 lipidphosphatase

daf-11 V 1413 -41 guanylate cyclasedaf-21 V 1972 -149 HSP-90daf-28 V — — �-insulindaf-36 V — — Rieske oxygenase, hormone pathway

daf-3 X 1148 -15 co-Smaddaf-6 X 147 -73 amphid morphdaf-9 X 1759 -27 cytP450daf-12 X 1276 -14 NHR ZF-rec

Opperman et al. www.pnas.org/cgi/content/short/0805946105 11 of 14

Page 12: Supporting Information - PNAS · Arachne. Sequence data were assembled using Arachne 2.0.1 on a Dell workstation running Red Hat Enterprise Linux 5.0 with ... R and summary statistics

Table S5. Operon conservation by C. elegans chromosome:Number of operons found on each of the C. eleganschromosomes and number of each of those found at leastpartially conserved in M. hapla

C. eleganschromosome

C. elegansoperons

No. conservedin M. hapla

I 259 24II 214 26III 270 18IV 202 17V 167 14X 58 2Totals 1,170 101

Opperman et al. www.pnas.org/cgi/content/short/0805946105 12 of 14

Page 13: Supporting Information - PNAS · Arachne. Sequence data were assembled using Arachne 2.0.1 on a Dell workstation running Red Hat Enterprise Linux 5.0 with ... R and summary statistics

Table S6. Operon conservation by C. elegans chromosome: List of C. elegans operons at leastpartially conserved in M. hapla

Operon C. elegans location No. of genes M. hapla together M. hapla total

Chromosome ICEOP1934 I:534461..542202 3 2 2CEOP1017 I:1176475..1258228 4 2 3CEOP1032 I:2070611..2069084 2 2 2CEOP1920 I:2485383..2468789 2 2 2CEOP1108 I:3759075..3748537 2 2 2CEOP1116 I:3824949..3828133 2 2 2CEOP1256 I:5509172..5505437 2 2 2CEOP1284 I:5869957..5857552 4 2 2CEOP1296 I:6078849..6053344 6 3 4CEOP1356 I:6915046..6917557 2 2 2CEOP1392 I:7304916..7308625 2 2 2CEOP1532 I:9110910..9100845 2 2 2CEOP1542 I:9249375..9232063 3 2 2CEOP1552 I:9619752..9611808 4 2 / 2 4CEOP1568 I:9938290..9934875 2 2 2CEOP1584 I:10026630..10050691 4 2 3CEOP1628 I:10622941..10628376 3 3 3CEOP1644 I:10734011..10740579 3 2 2CEOP1660 I:11923898..11911995 3 2 3CEOP1676 I:12741216..12730012 2 2 2CEOP1692 I:13339385..13343923 2 2 2CEOP1904 I:14406015..14433728 3 2 2CEOP1744 I:14647769..14661401 3 3 3CEOP1776 I:15052407..15040414 3 2 3

Chromosome II

CEOP2012 II:97942..88565 3 2 2CEOP2056 II:1922868..1925698 2 2 2CEOP2084 II:3076941..3062537 2 2 2CEOP2120 II:4299651..4305270 2 2 2CEOP2124 II:4323254..4334479 4 2 3CEOP2132 II:4706801..4713678 2 2 2CEOP2236 II:6123764..6127894 2 2 2CEOP2284 II:7073728..7077120 2 2 2CEOP2038 II:7708630..7701865 3 2 2CEOP2336 II:8062666..8068107 2 2 2CEOP2348 II:8457742..8470069 4 2 3CEOP2416 II:9177065..9165842 3 2 2CEOP2424 II:9324790..9341302 5 2 4CEOP2428 II:9436023..9431300 2 2 2CEOP2440 II:9647804..9650807 2 2 2CEOP2452 II:9745982..9738897 2 2 2CEOP2696 II:10039855..10030811 2 2 2CEOP2496 II:10579281..10593445 7 2 2CEOP2532 II:11348340..11351134 2 2 2CEOP2540 II:11473471..11468461 3 3 3CEOP2584 II:11980334..11997909 3 2 3CEOP2588 II:12039049..12028671 2 2 2CEOP2688 II:12202285..12189785 2 2 2CEOP2616 II:13574962..13522820 5 2 3CEOP2628 II:14145455..14123862 5 3 3CEOP2636 II:14346052..14339963 2 2 2

Chromosome III

CEOP3020 III:235841..264395 2 2 2CEOP3023 III:348608..335955 2 2 2CEOP3062 III:1532758..1509397 2 2 2CEOP3086 III:2819272..2838409 4 2 4CEOP3164 III:4269242..4242644 5 3 5CEOP3212 III:4704368..4706913 2 2 2CEOP3228 III:4878626..4881460 2 2 2CEOP3240 III:4930167..4914059 5 2 2CEOP3268 III:5228374..5239987 4 2 2

Opperman et al. www.pnas.org/cgi/content/short/0805946105 13 of 14

Page 14: Supporting Information - PNAS · Arachne. Sequence data were assembled using Arachne 2.0.1 on a Dell workstation running Red Hat Enterprise Linux 5.0 with ... R and summary statistics

Operon C. elegans location No. of genes M. hapla together M. hapla total

CEOP3272 III:5254429..5267102 5 4 4CEOP3388 III:6454425..6462163 2 2 2CEOP3424 III:7234131..7221670 6 2 4CEOP3508 III:8331838..8326497 2 2 2CEOP3516 III:8372174..8387418 4 2 3CEOP3708 III:10931090..10951051 3 2 2CEOP3843 III:11290581..11276174 3 2 3CEOP3740 III:11920123..11917716 2 2 2CEOP3744 III:11920597..11929260 3 2 2

Chromosome IV

CEOP4020 IV:393880..399084 2 2 2CEOP4050 IV:1902322..1921616 2 2 2CEOP4116 IV:4172095..4179518 4 2 2CEOP4148 IV:4749773..4747892 2 2 2CEOP4208 IV:7039564..7031544 3 2 2CEOP4240 IV:7709714..7703872 3 2 2CEOP4272 IV:8136436..8154827 5 2 5CEOP4280 IV:8290255..8301081 6 2 2CEOP4296 IV:8691495..8682471 3 2 3CEOP4336 IV:9725226..9731054 2 2 2CEOP4360 IV:10176650..10183910 3 3 3CEOP4404 IV:11082225..11076068 3 2 2CEOP4428 IV:11298221..11295119 2 2 2CEOP4484 IV:11961505..11957518 2 2 2CEOP4490 IV:12093442..12089218 2 2 2CEOP4492 IV:12094342..12108361 3 2 2CEOP4596 IV:17118716..17090587 6 1 / 2 / 2 5

Chromosome V

CEOP553 V:3196839..3193393 2 2 2CEOP5092 V:5940007..5950974 2 2 2CEOP5112 V:6434111..6437867 2 2 2CEOP5124 V:6605360..6599191 2 2 2CEOP5136 V:6991843..6986123 3 2 2CEOP5176 V:8031704..8039486 2 2 2CEOP5539 V:8269233..8274025 2 2 2CEOP5260 V:10770296..10772807 2 2 2CEOP5300 V:11764882..11769863 2 2 2CEOP5526 V:12967961..12966006 2 2 2CEOP5385 V:13333853..13329179 2 2 2CEOP5400 V:13807218..13800970 2 2 2CEOP5452 V:16023324..16027418 2 2 2CEOP552 V:17980048..17954088 2 2 2

Chromosome X

CEOPX154 X:587483..563725 2 2 2CEOPX036 X:3116587..3112074 3 2 2

Operon name, location, and number of genes are based on C. elegans data found on WormBase. �M. haplatogether� indicates the number of genes from the C. elegans operon found in proximity to each other. �M. haplatotal� is the total number of genes found from the operon in the M. hapla dataset. Numbers split by a �/� indicatemore than one set of genes that are in proximity within the set, but not between sets (e.g., 2/2 indicates that forthe four genes in operon CEOP1552, there are two sets of 2 conserved genes, but those 2 sets are not in proximityto each other).

Opperman et al. www.pnas.org/cgi/content/short/0805946105 14 of 14


Recommended