Date post: | 29-Dec-2015 |
Category: |
Documents |
Upload: | cecily-may |
View: | 214 times |
Download: | 0 times |
05/04/2005
Informatics Meeting
C. elegans C. elegans – “Back To The – “Back To The Future”. Future”.
Paul Davis (aka Huey)
05/04/2005
Informatics Meeting
OverviewOverview≈ C. elegans Gene Prediction
≈ Past.≈ Overview of genome project.≈ 1st Pass annotation
≈ Present.≈ Script based list generation.
≈ Gene Refinement (Transcript Based).≈ Small peptides.
≈ C. briggsae comparison.≈ Large external gene family analysis.
≈ Future.≈ Un-annotated Overlap between gene predictors≈ Gene Family curation.≈ Multiple species comparison.
≈ Summary.
05/04/2005
Informatics Meeting
PastPast≈ Genome Project
≈ C. elegans 1st multicellular organism genome published 1998.
≈ 97-Mb of sequence made up of ≈ 2527 cosmids, ≈ 257 YACs,≈ 113 fosmids,≈ 44 PCR products.
≈ 5 gaps closed by 2002.≈ Annotated to find 19,099 protein coding genes.
≈ 1st pass annotation Genefinder (Phil Green WASHU).
≈ Curators appraised gene predictions on a clone by clone basis as they were finished.
05/04/2005
Informatics Meeting
Genome ViewGenome View
PredictedPartially Confirmed Confirmed
Colour corresponds to strand not confidence.
05/04/2005
Informatics Meeting
Stats for WS141Stats for WS141
≈ Currently 22,436 gene predictions.≈ 11,169 “un-touched”
≈ + good 1st pass annotation.≈ + re-annotated >50%.
≈ 2,576 Confirmed status.≈ Unlikely to change.
≈ 5,624 Partially Confirmed.≈ Potentially modified.
≈ 2,969 Predicted.≈ Potentially removed or altered.
05/04/2005
Informatics Meeting
PresentPresent(re)annotation of a genome(re)annotation of a genome
Painting by numbers
Painting the Forth Rail Bridge
05/04/2005
Informatics Meeting
(re)annotating a genome(re)annotating a genome≈ We adopted a ‘paint by numbers’ approach involving
automated appraisal of all gene models on a regular basis.≈ Generation of lists of genes/features to be checked by
human annotators.
Appraise Curate
Process and report
Release and synchronise
05/04/2005
Informatics Meeting
Script Based Targeted Script Based Targeted AnnotationAnnotation
≈ Create a number of curation lists≈ Confirmed introns not in gene models≈ ESTs/mRNAs in introns.≈ Overlapping Gene predictions.≈ Predictions overlapping known repeats.≈ Short Genes <150bp≈ Short introns <40bp
05/04/2005
Informatics Meeting
Transcript Based RefinementsTranscript Based Refinements
≈ Automatic import of transcript data during our build cycle.≈ C. elegans mRNAs/cDNAs.≈ C. elegans ESTs. ≈ Nematode ESTs.
≈ Processed and aligned to genome.≈ This produces data for our curation lists
05/04/2005
Informatics Meeting
Gene Refinement Fmap ViewGene Refinement Fmap View
≈ EST data points to 5’ extension and 3’ extension.
≈ Identified due to confirmed introns not in a gene model
5’
3’
Transcript Data
Refined Prediction
Old prediction
Confirmed intron.
05/04/2005
Informatics Meeting
Not all <150bp Predictions are Not all <150bp Predictions are Bad?Bad?
≈ Small peptides can be real.≈ H12D21.1 is a 34 aa peptide that appeared
on curation list.≈ Investigated.≈ Prediction had peptide similarity to 2 other
elegans proteins.≈ Multi sequence alignment proved
interesting.
05/04/2005
Informatics Meeting
H12D21.1 + Homols H12D21.1 + Homols Fmap View & M.S.A.Fmap View & M.S.A.
SignalP cleavage site
Gene Prediction
Protein Homology Blocks
05/04/2005
Informatics Meeting
New Family MembersNew Family Members
≈ Used tBlastn to identify other regions in genome,≈ Annotated these ORFs to give.≈ 9 additional family members≈ These have been called nspa-1 to 12
≈ Nematode Specific Peptide family A
Pseudogene
Expanded Family
05/04/2005
Informatics Meeting
C. briggsae C. briggsae ComparisonComparison
≈ C. elegans vs C. briggsae≈ C. briggsae hybrid gene set analysis (Avril Coghlan).
≈ Detailed in PloS Biol 2003 1:166-192 “The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics.”
≈ WormBase Has worked to incorporate the ~1300 new genes reported.
05/04/2005
Informatics Meeting
Coding Gene Predictions Over Time.Coding Gene Predictions Over Time.
Increase in CDS due to 1st round of new genes identified by comparison with briggsae.
17500
18000
18500
19000
19500
20000
20500
21000
21500
22000
22500
WS
21
WS
24
WS
27
WS
30
WS
33
WS
36
WS
39
WS
42
WS
45
WS
48
WS
51
WS
54
WS
57
WS
60
WS
73
WS
76
WS
79
WS
82
WS
85
WS
88
WS
91
WS
94
WS
97
WS
100
WS
103
WS
106
WS
109
WS
112
WS
115
WS
118
WS
121
WS
124
Release
Number
PredictionsIncluding Isoforms
Coding Genes
briggsae hybrid gene set
05/04/2005
Informatics Meeting
Large family analysisLarge family analysis
≈ Worm Community Members.≈ Multi Sequence Alignments of some
large Families.≈ 7 TM receptor families
≈ 1700 family members≈ Sub families have been worked on by multiple
worm community members.≈ Hugh Robertson (University of Illinois)≈ Jim Thomas (University of Washington Seattle)≈ Jack Chen (CSH Laboratories)
05/04/2005
Informatics Meeting
FutureFuture
≈ Identify new avenues for gene refinement and identification.
≈ Looking at predictor overlaps≈ (Genefinder/Twinscan overlaps) vs
(WormBase Gene set)
≈ In house protein family analysis≈ Multiple species comparisons
05/04/2005
Informatics Meeting
Predictor Overlaps.Predictor Overlaps.
GenefinderPrediction
TwinscanPrediction
New CDS Prediction
Strong Splicing
Good briggsae DNA::DNA Alignment
05/04/2005
Informatics Meeting
Gene Family AnalysisGene Family Analysis
≈ Protein alignments of multiple family members can refine gene predictions.≈ ClustalW≈ blast≈ Main problems identified
≈ Incorrect splicing≈ Truncations≈ Invalid extensions
05/04/2005
Informatics Meeting
Example of a Small Family Example of a Small Family Analysis.Analysis.
≈ Problematic alignment≈ F56H6.9 appears to have 18aa extra sequence.≈ E03H4.4 seems to be lacking sequence.
05/04/2005
Informatics Meeting
Example of Problem.Example of Problem.
≈ Problematic alignment
≈ Alignment following annotation.
05/04/2005
Informatics Meeting
Multiple Species Comparison.Multiple Species Comparison.≈ More nematode genomes are on their way
≈ C. remanei≈ shotgun in progress
≈ Blast server available http://genome.wustl.edu/projects/cremanei/
≈ PB2801≈ shotgun in progress
≈ C. japonica ≈ shotgun in progress
05/04/2005
Informatics Meeting
elegans/briggsae/remaneielegans/briggsae/remanei Alignment for nspa- like peptides.Alignment for nspa- like peptides.
05/04/2005
Informatics Meeting
SummarySummary
≈ Gene (Re)annotation >7 years.≈ New genes are still being discovered.
≈ Primarily Transcript driven.≈ More work on protein families≈ New strategies for gene prediction and
refinement.≈ Using multiple gene predictors≈ Multi species comparison
05/04/2005
Informatics Meeting
AcknowledgementsAcknowledgements≈ Genome Sequencing Center St. Louis
≈ Sequencing and finishing teams etc.
≈ WormBase team
Tamberlyn Bieri Darin Blasiar
Phil Ozersky John Spieth
≈ Wellcome Trust Sanger Institute ≈ Sequencing and finishing teams etc.
≈ WormBase team
Richard Durbin Anthony Rogers
Dan Lawson Mary Ann Tuli
≈ AceDB Ed Griffiths Roy Storey