Advancing Science with DNA Sequence
Finding the genes in Finding the genes in microbial genomesmicrobial genomes
Natalia IvanovaNatalia IvanovaMGM WorkshopMGM Workshop
January 31, 2012January 31, 2012
Advancing Science with DNA Sequence
1.1. Introduction Introduction
2. Tools out there2. Tools out there
3. Basic principles behind tools 3. Basic principles behind tools and known problemsand known problems
4. Metagenomes4. Metagenomes
Outline
Advancing Science with DNA Sequence
Sequence features in prokaryotic genomes: stable RNA-coding genes (rRNAs, tRNAs, RNA component of RNaseP, tmRNA) protein-coding genes (CDSs) transcriptional features (mRNAs, operons, promoters, terminators, protein-binding sites, DNA bends) translational features (RBS, regulatory antisense RNAs, mRNA secondary structures, translational recoding and programmed frameshifts, inteins) pseudogenes (tRNA and protein-coding genes) …
Finding the genes in microbial genomesfeaturesfeatures
Well-annotated bacterial genome in Artemis genome viewer:
Advancing Science with DNA Sequence
1.1. Introduction Introduction
2. Tools out there2. Tools out there (don’t bother to write down the names and links, (don’t bother to write down the names and links,
all presentations will be available on the web all presentations will be available on the web site)site)
3. Known problems3. Known problems
4. Metagenomes4. Metagenomes
Outline
Advancing Science with DNA Sequence
IMG-ERhttp://img.jgi.doe.gov/
RASThttp://rast.nmpdr.org/
JCVI Annotation Servicehttp://www.jcvi.org/cms/research/projects/
annotation-service/
RefSeqhttp://www.ncbi.nlm.nih.gov/genomes/static/
Pipeline.html
Publicly available genome annotation services
Advancing Science with DNA Sequence
• Large structural RNAs (23S and 16S rRNAs)BLASTn
RNAmmer http://www.cbs.dtu.dk/services/RNAmmer/
• Small structural RNAs (5S rRNA, tRNAs, tmRNA, RNaseP RNA component)Rfam database, INFERNAL search tool
http://www.sanger.ac.uk/Software/Rfam/
http://rfam.janelia.org/http://infernal.janelia.org/
tRNAScan-SE
http://lowelab.ucsc.edu/tRNAscan-SE/
What they provide and how they do it - RNAs
Advancing Science with DNA Sequence
Reading frames: translations of the nucleotide sequence with an offset of 0, 1 and 2 nucleotides (three possible translations in each direction)Open reading frame (ORF): reading frame between a start and stop codon
What they provide and how they do it – protein-coding gens (CDSs - not ORFs!)
Advancing Science with DNA Sequence
Gene finders: ab initio tools; evidence-based refinement
Ab initio tools used by the pipelines:Glimmer family (Glimmer2, Glimmer3, RBS finder) ->NCBI,
RAST, JCVIhttp://glimmer.sourceforge.net/
GeneMark family (GeneMark-hmm, GeneMarkS) ->NCBIhttp://exon.gatech.edu/GeneMark/
PRODIGAL -> IMG-ER, NCBIhttp://compbio.ornl.gov/prodigal/
Evidence-based refinement: mostly undocumented in-house developed tools.Types of corrections:missed genes (RAST, JCVI, NCBI), frameshifts (JCVI, NCBI), start sites
(RAST)
Advancing Science with DNA Sequence
1.1. IntroductionIntroduction
2.2.Tools out thereTools out there
3.3.Basic principles behind Basic principles behind toolstools
4.4.MetagenomesMetagenomes
Outline
Advancing Science with DNA Sequence
What is ab initio gene finder?
Two major approaches to prediction of protein-coding genes:
• ab initio (ORFs with nucleotide composition similar to CDSs are also CDSs)
Advantages: finds “unique” genes; high sensitivity; very fast!Limitations: often misses “unusual” genes; high rate of false
positives
“evidence-based” (ORFs with translations homologous to the known proteins are CDSs)
Advantages: finds “unusual” genes (e. g. horizontally transferred); relatively low rate of false positive predictions
Limitations: cannot find “unique” genes; low sensitivity on short genes; prone to propagation of false positive results of ab initio annotation tools; slow!
Advancing Science with DNA Sequence
How ab initio tools work – very briefly
• Statistical model of coding and non-coding regions (codon or dicodon frequencies, hidden Markov models of different lengths)
• Statistical model architecture
• Additional algorithms for refinement of predictions (RBS finder, overlap resolution, etc.)
Prokaryotic gene model used by all ab initio gene findersRibosome-binding site within certain distance of the start codon;One of 3 start codons;One of 3 stop codons;No frame interruptions
Ribosome binding site
Start codon:ATG, GTG, TTG
Stop codon:TAG, TAA, TGA
open reading frame
Advancing Science with DNA Sequence
Known problems of all annotation pipelines
• RNAs– Incomplete rRNAs– Trans-spliced tRNAin archaeal genomes– Small structural RNAs not predicted at all
• Protein-coding genes that don’t fit into prokaryotic gene model used by ab initio gene finders– no RBS (leaderless transcripts)– interrupted translation frame
sequencing errors ortranslational exceptions
– non-canonical start
Ribosome binding
site
Start codon:ATG, GTG, TTG
Stop codon:TAG, TAA, TGA
open reading frame
Genome Sequencing center
16S rRNA, nt
Synechococcus sp. CC9311 UCSD, TIGR 1477
Synechococcus sp. CC9605 JGI 1440
Synechococcus elongatus PCC 7942
JGI 1490
Synechococcus sp. JA-2-3BA(2-13)
TIGR 1323
Synechococcus sp. JA-3-3Ab TIGR 1324
Synechococcus sp. RCC307 Genoscope 1498
Synechococcus sp. WH7803 Genoscope 1497, 1464
Advancing Science with DNA Sequence
Symptoms of gene finding problems
• Some type of mandatory features (rRNAs, tRNAs, CDSs) is missing
• “Truncated” genes (shorter than homologs) => funky translation initiation features (non-canonical start codons, leaderless transcripts)
• Many “unique” genes without protein family assignment or BLASTp hit => sequencing errors (frameshifts)
• Undetected selenocysteines, programmed frameshifts in ~50 well-conserved protein families
Advancing Science with DNA Sequence
Supplemental tools
TIS (translation initiation site) prediction/correction
TICO http://tico.gobics.de/TriTISA http://mech.ctb.pku.edu.cn/protisa/TriTISA
Two tools often disagree about the best TIS, especially in high GC genomes
Operon predictionJPOP http://csbl.bmb.uga.edu/downloads/#jpophttp://www.cse.wustl.edu/~jbuhler/research/operons/http://www.sph.umich.edu/~qin/hmm/
Proteins with unusual translational features – selenocysteine-containing genesbSECISearch http://genomics.unl.edu/bSECISearch/
Advancing Science with DNA Sequence
Metagenomes sequenced with new technologies: low-coverage problems
• Both 454 and Illumina require high sequence coverage in order to achieve high sequence quality (25x to >100x)
• High sequence coverage cannot be achieved for metagenome data
How does this affect metagenome annotation?~70% of 454 Titanium reads have at least 1 sequencing
artifact (basecalls in homopolymeric runs), there is no clear pattern of error distribution
>100 bp Illumina reads have ~3% error rate, error rate is higher towards the end of the read, the majority of errors are substitutions
metagenomegenome
sequence sequence
coverage
coverage
Advancing Science with DNA Sequence
Just one example…
predicted gene
3 frameshifts
4-read contig, 1476 nt, no misassembly
Contig has 27 homopolymers (3 nt and more), 3 of them have errorsNo correlation with homopolymer type or error type Reads were quality trimmed prior to assembly
Advancing Science with DNA Sequence
Metagenome annotation tools (more details will be given)
GeneMark (GeneMark-hmm for reads, GeneMarkS for longer contigs)http://exon.gatech.edu/GeneMark/
• MetaGenehttp://metagene.cb.k.u-tokyo.ac.jp/metagene/
• FragGeneScanhttp://omics.informatics.indiana.edu/FragGeneScan/
Full-service annotation pipelines• IMG/M-ER – “metagenome gene calling” + other
optionshttp://img.jgi.doe.gov/submit
• MG-RASThttp://metagenomics.nmpdr.org/
• CAMERA annotation pipelinehttp://camera.calit2.net/