+ All Categories
Home > Documents > 05/04/2005 Informatics Meeting C. elegans – “Back To The Future”. Paul Davis (aka Huey)

05/04/2005 Informatics Meeting C. elegans – “Back To The Future”. Paul Davis (aka Huey)

Date post: 29-Dec-2015
Category:
Upload: cecily-may
View: 214 times
Download: 0 times
Share this document with a friend
26
05/04/2005 Informatics Meeting C. elegans C. elegans – “Back To – “Back To The Future”. The Future”. Paul Davis (aka Huey)
Transcript

05/04/2005

Informatics Meeting

C. elegans C. elegans – “Back To The – “Back To The Future”. Future”.

Paul Davis (aka Huey)

05/04/2005

Informatics Meeting

OverviewOverview≈ C. elegans Gene Prediction

≈ Past.≈ Overview of genome project.≈ 1st Pass annotation

≈ Present.≈ Script based list generation.

≈ Gene Refinement (Transcript Based).≈ Small peptides.

≈ C. briggsae comparison.≈ Large external gene family analysis.

≈ Future.≈ Un-annotated Overlap between gene predictors≈ Gene Family curation.≈ Multiple species comparison.

≈ Summary.

05/04/2005

Informatics Meeting

PastPast≈ Genome Project

≈ C. elegans 1st multicellular organism genome published 1998.

≈ 97-Mb of sequence made up of ≈ 2527 cosmids, ≈ 257 YACs,≈ 113 fosmids,≈ 44 PCR products.

≈ 5 gaps closed by 2002.≈ Annotated to find 19,099 protein coding genes.

≈ 1st pass annotation Genefinder (Phil Green WASHU).

≈ Curators appraised gene predictions on a clone by clone basis as they were finished.

05/04/2005

Informatics Meeting

Genome ViewGenome View

PredictedPartially Confirmed Confirmed

Colour corresponds to strand not confidence.

05/04/2005

Informatics Meeting

Stats for WS141Stats for WS141

≈ Currently 22,436 gene predictions.≈ 11,169 “un-touched”

≈ + good 1st pass annotation.≈ + re-annotated >50%.

≈ 2,576 Confirmed status.≈ Unlikely to change.

≈ 5,624 Partially Confirmed.≈ Potentially modified.

≈ 2,969 Predicted.≈ Potentially removed or altered.

05/04/2005

Informatics Meeting

PresentPresent(re)annotation of a genome(re)annotation of a genome

Painting by numbers

Painting the Forth Rail Bridge

05/04/2005

Informatics Meeting

(re)annotating a genome(re)annotating a genome≈ We adopted a ‘paint by numbers’ approach involving

automated appraisal of all gene models on a regular basis.≈ Generation of lists of genes/features to be checked by

human annotators.

Appraise Curate

Process and report

Release and synchronise

05/04/2005

Informatics Meeting

Script Based Targeted Script Based Targeted AnnotationAnnotation

≈ Create a number of curation lists≈ Confirmed introns not in gene models≈ ESTs/mRNAs in introns.≈ Overlapping Gene predictions.≈ Predictions overlapping known repeats.≈ Short Genes <150bp≈ Short introns <40bp

05/04/2005

Informatics Meeting

Transcript Based RefinementsTranscript Based Refinements

≈ Automatic import of transcript data during our build cycle.≈ C. elegans mRNAs/cDNAs.≈ C. elegans ESTs. ≈ Nematode ESTs.

≈ Processed and aligned to genome.≈ This produces data for our curation lists

05/04/2005

Informatics Meeting

Gene Refinement Fmap ViewGene Refinement Fmap View

≈ EST data points to 5’ extension and 3’ extension.

≈ Identified due to confirmed introns not in a gene model

5’

3’

Transcript Data

Refined Prediction

Old prediction

Confirmed intron.

05/04/2005

Informatics Meeting

Not all <150bp Predictions are Not all <150bp Predictions are Bad?Bad?

≈ Small peptides can be real.≈ H12D21.1 is a 34 aa peptide that appeared

on curation list.≈ Investigated.≈ Prediction had peptide similarity to 2 other

elegans proteins.≈ Multi sequence alignment proved

interesting.

05/04/2005

Informatics Meeting

H12D21.1 + Homols H12D21.1 + Homols Fmap View & M.S.A.Fmap View & M.S.A.

SignalP cleavage site

Gene Prediction

Protein Homology Blocks

05/04/2005

Informatics Meeting

New Family MembersNew Family Members

≈ Used tBlastn to identify other regions in genome,≈ Annotated these ORFs to give.≈ 9 additional family members≈ These have been called nspa-1 to 12

≈ Nematode Specific Peptide family A

Pseudogene

Expanded Family

05/04/2005

Informatics Meeting

C. briggsae C. briggsae ComparisonComparison

≈ C. elegans vs C. briggsae≈ C. briggsae hybrid gene set analysis (Avril Coghlan).

≈ Detailed in PloS Biol 2003 1:166-192 “The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics.”

≈ WormBase Has worked to incorporate the ~1300 new genes reported.

05/04/2005

Informatics Meeting

Coding Gene Predictions Over Time.Coding Gene Predictions Over Time.

Increase in CDS due to 1st round of new genes identified by comparison with briggsae.

17500

18000

18500

19000

19500

20000

20500

21000

21500

22000

22500

WS

21

WS

24

WS

27

WS

30

WS

33

WS

36

WS

39

WS

42

WS

45

WS

48

WS

51

WS

54

WS

57

WS

60

WS

73

WS

76

WS

79

WS

82

WS

85

WS

88

WS

91

WS

94

WS

97

WS

100

WS

103

WS

106

WS

109

WS

112

WS

115

WS

118

WS

121

WS

124

Release

Number

PredictionsIncluding Isoforms

Coding Genes

briggsae hybrid gene set

05/04/2005

Informatics Meeting

Large family analysisLarge family analysis

≈ Worm Community Members.≈ Multi Sequence Alignments of some

large Families.≈ 7 TM receptor families

≈ 1700 family members≈ Sub families have been worked on by multiple

worm community members.≈ Hugh Robertson (University of Illinois)≈ Jim Thomas (University of Washington Seattle)≈ Jack Chen (CSH Laboratories)

05/04/2005

Informatics Meeting

FutureFuture

≈ Identify new avenues for gene refinement and identification.

≈ Looking at predictor overlaps≈ (Genefinder/Twinscan overlaps) vs

(WormBase Gene set)

≈ In house protein family analysis≈ Multiple species comparisons

05/04/2005

Informatics Meeting

Predictor Overlaps.Predictor Overlaps.

GenefinderPrediction

TwinscanPrediction

New CDS Prediction

Strong Splicing

Good briggsae DNA::DNA Alignment

05/04/2005

Informatics Meeting

Gene Family AnalysisGene Family Analysis

≈ Protein alignments of multiple family members can refine gene predictions.≈ ClustalW≈ blast≈ Main problems identified

≈ Incorrect splicing≈ Truncations≈ Invalid extensions

05/04/2005

Informatics Meeting

Example of a Small Family Example of a Small Family Analysis.Analysis.

≈ Problematic alignment≈ F56H6.9 appears to have 18aa extra sequence.≈ E03H4.4 seems to be lacking sequence.

05/04/2005

Informatics Meeting

Fmap View of F56H6.9Fmap View of F56H6.9

05/04/2005

Informatics Meeting

Example of Problem.Example of Problem.

≈ Problematic alignment

≈ Alignment following annotation.

05/04/2005

Informatics Meeting

Multiple Species Comparison.Multiple Species Comparison.≈ More nematode genomes are on their way

≈ C. remanei≈ shotgun in progress

≈ Blast server available http://genome.wustl.edu/projects/cremanei/

≈ PB2801≈ shotgun in progress

≈ C. japonica ≈ shotgun in progress

05/04/2005

Informatics Meeting

elegans/briggsae/remaneielegans/briggsae/remanei Alignment for nspa- like peptides.Alignment for nspa- like peptides.

05/04/2005

Informatics Meeting

SummarySummary

≈ Gene (Re)annotation >7 years.≈ New genes are still being discovered.

≈ Primarily Transcript driven.≈ More work on protein families≈ New strategies for gene prediction and

refinement.≈ Using multiple gene predictors≈ Multi species comparison

05/04/2005

Informatics Meeting

AcknowledgementsAcknowledgements≈ Genome Sequencing Center St. Louis

≈ Sequencing and finishing teams etc.

≈ WormBase team

Tamberlyn Bieri Darin Blasiar

Phil Ozersky John Spieth

≈ Wellcome Trust Sanger Institute ≈ Sequencing and finishing teams etc.

≈ WormBase team

Richard Durbin Anthony Rogers

Dan Lawson Mary Ann Tuli

≈ AceDB Ed Griffiths Roy Storey


Recommended