Date post: | 30-Dec-2015 |
Category: |
Documents |
Upload: | paulina-shepherd |
View: | 226 times |
Download: | 1 times |
Gene prediction in flies
● Background● Gene prediction pipeline● Resources
Background
Genome quality
Species CoverageD. simulans 17 142 23,640 15 (11)D. sechellia 14,730 167 ~3-fold 6,695 9 (6)D. yakuba 20 169 ~6-fold 13,502 6 (4)D. erecta 5,124 153 ~12-fold 2,486 8 (5)D. ananassae 13,749 231 ~8-fold 6,783 17 (7)D. pseudoobscura 4,896 153 ~7-fold 8,729 7 (4)D. persimilis 12,838 188 ~4-fold 13,975 13 (7)D. willistonis 14,927 237 ~6-fold 5,716 12 (5)D. virilis 13,530 206 ~9-fold 4,852 17 (8)D. mojavensis 6,841 194 ~8-fold 5,033 14 (7)D. grimshawi 17,440 200 ~8-fold 6,717 14 (7)
ContigsSize /
MbNumber of gaps
Nucleotides in gaps / Mb
~3-fold+6x-fold1
Genes in Drosophila melanogaster
● high gene density● at least 20% with alternative transripts● can be nested
on the same strand on different strands
● di-cistronic● involve trans-splicing
exons from a different strand
Gene prediction pipeline
● Gene prediction by homology no ab-initio predictions not using genomic alignments
● TBLASTN/Genewise process quick genome scan to find putative gene containing
regions aligning peptide sequence to genomic fragment
using a gene model● cds● introns● splice-sites
Genome Templatetranscripts
Representativetranscripts
Alternativetranscripts
Genome scan
Geneprediction
Geneprediction
Geneprediction
Predictedalternativetranscripts
Predictedrepresentative
transcripts
Predictedrepresentative
transcripts
Gene predictions
Gene assignmentRedundancy removal
Quality control
Regions
Pre-processing
Sensitivity – Selectivity - Speed
● Genome scan strict trade-off between
● sensitivity versus memory/time
● Transcript prediction t = O(MN)
● N: length of peptide sequence = quite short● M: length of DNA sequence = large
you want to minimize● the length of the genomic sequence to search● the number of fragments you align
Solutions
● ENSEMBL: Minigenes cut out putative introns
● My pipeline: priority lists gene structure conservation
Difficulties
● Terminal exons short and thus alignment signal is weak
● Spindly genes there is no length penalty on introns
Concepts
● Predict in three passes
1)Predict clear cut cases
2)Predict dubious cases only if they don't overlap with a previous prediction
3)Predict alternative transcripts● Iteratively search for duplications● Accept a prediction with conserved exon
boundaries
Conservation of gene structure
QueryPredictionConserved
QueryPredictionPartially conserved
QueryPredictionSingle exon
QueryPredictionRetrotransposed
QueryPredictionUnconserved
(exon boundaries of query/prediction mapped on query protein)
Quality control
● Classify predictions into categories Full length or fragment Gene or pseudogene Conserved or not conserved gene structure
● Heuristically remove predictions that are redundant that are in conflict
● nested genes● good predictions take precedence over bad predictions
Results
● http://wwwfgu.anat.ox.ac.uk:8080/cgi-bin/gbrowse
Number of predicted genes
d m e l d sim d se c d ya k d e re d a n a d p se d p e r d w il d vir d m o j d g r i0
2 0 0 0
4 0 0 0
6 0 0 0
8 0 0 0
1 0 0 0 0
1 2 0 0 0
1 4 0 0 0
1 6 0 0 0
Ge
ne
s
Genom e
Genes: conserved semi-conserved single exon
Pseudogenes:
d sim d se c d ya k d e re d a n a d p se d p e r d w il d vir d m o j d g ri0
2 0 0 0
4 0 0 0
6 0 0 0
8 0 0 0
1 0 0 0 0
1 2 0 0 0
1 4 0 0 0N
um
be
r o
f D.
me
lan
og
ast
er
ge
ne
s
Genom e
Orthology assignments
Genes in D. melanogaster with ortholgs
Technical details● Hardware:
28 dual CPU nodes with 2Gb memory sun grid engine (SGE)
● Pipeline logic gmake
● Tasks Python scripts (and Perl scripts) Bash/awk scripts
● Database Postgres
Downstream analysis
● Pairwise orthology assignment PhyOP Pipeline (Leo Goodstadt (2006))
● Multiple orthology assignment My own concoction based on graph clustering with
some consistency criteria● Multiple alignment of cds
Dialign (<50 sequences) Muscle (<500 sequences)
Phylogenetic analysis
● 14,000 GBlocks cleaned multiple alignments● Calculation of ka and ks with PAML● Phylogenetic trees
Genome trees Gene trees built with Fitch/Kitsch
Odds and bits
● Mapping of Pdb -> Uniprot -> dmel proteins● Mapping of Interpro domains onto predictions
not up-to-date● Codon bias analysis
ENC, CAI, information theoretic measures GC3, GC3_4D
Comparison of measures
Experimental CAI
Computational CAI
ENC
GC3
Encoding | bias
Encoding | unbiased
Encoding | uniform
Ribosomal CAI
Other groups
● see http://rana.lbl.gov/drosophila/wiki/index.php/Main_Page
● Gene predictions by others Don Gilbert: SNAP Lior Pachter: GeneMapper (genomic alignments) Eisen Lab : TBLastN + Genewise/Exonerate, GeneMapper Batzoglou Lab: CONTRAST Brent Lab: N-Scan Guigo: geneid and SGP2
http://insects.eugenes.org/species/news/genome-summaries/genepredictions.html
Consensus predictions
● Gbrowser comparison of all gene predictions http://rana.lbl.gov/drosophila/gbrowse/cgi-bin/gbrowse
● Mike Eisen's group: GLEAN consensus set● Don Gilbert: http://insects.eugenes.org/species/● Other resources
tRNA predictions genome alignments