+ All Categories
Home > Documents > Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes...

Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes...

Date post: 29-Dec-2015
Category:
Upload: iris-holt
View: 219 times
Download: 2 times
Share this document with a friend
Popular Tags:
36
Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI
Transcript
Page 1: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Annotation of Anopheline Genomes at VectorBase

Dan Lawson, VectorBase & The Anopheles Genomes Cluster ConsortiumEMBL-EBI

Page 2: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Anopheline species in this study: Current status

Genome sequencing

• 9 of 16 species assembled and annotated

RNAseq

• 10 of 12 species sequenced

Isolate re-sequencing

• 12 of 12 species sequenced

Page 3: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Genome annotation

• First-pass genome annotation is almost always based on “automatic” computational approaches

• ab initio

• Similarity based

• Transcript (ESTs, RNAseq)

• Protein (nr protein database)

Page 4: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Genome annotation

• First-pass genome annotation is almost always based on “automatic” computational approaches

• ab initio

• Similarity based

• Transcript (ESTs, RNAseq)

• Protein (nr protein database)

Page 5: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Genome annotation

• First-pass genome annotation is almost always based on “automatic” computational approaches

• ab initio

• Similarity based

• Transcript (ESTs, RNAseq)

• Protein (nr protein database)

Page 6: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Genome assembly

Map Repeats

Genefinding

Protein-coding genes

Map Transcripts Map Peptides

nc-RNAs

Functional annotation

Submission to archival databases (Release)

Genome annotation - building a pipeline

Page 7: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Automatic annotation strategies

similarityab initio

Page 8: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Genome annotation: resources

• ab initio predictions using SNAP and Augustus

• Mixed whole animal RNAseq datasets generated using Illumina sequencing

• Assembled using Trinity (Broad Institute)

• Many dipteran proteomes (including 4 mosquitoes & D. melanogaster)

• All arthropod/metazoan proteomes

Page 9: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

MAKER annotation with RNAseq and reference proteomes

• Aim:

• Gene prediction aggregation for the masses.

• Used for a number of arthropod genome projects

• Touted as the default pipeline for many more (part of the GMOD toolkit)

• Overview

• ab-initio gene predictions from SNAP, Augustus & FGENESH

• Final gene models from MAKER

• Similarity alignments from both EXONERATE and BLAST

• Repeats from RepeatFinder & RepeatMasker

• Additional data sets integrated via GFF3 files (RNA-Seq)

• Uses MPI for parallelization over a compute farm

• Summary

• Iterative runs give acceptable reference gene sets.

• Used for Heliconius, Glossina, sandflies and the first tranche of the 16 Anophelines

Page 10: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Current VectorBase annotation pipeline

• MAKER based automatic annotation

• includes SNAP training and ab initio

• RNAseq based transcript similarity prediction

• Taxonomically constrained peptide similarity prediction

• 2 rounds of prediction refinement & final round includes all peptide similarity

• Community annotation phase

• Capture gene structure changes

• Metadata associated with locus (symbol, description, citation)

• Submission to INSDC, propagation to UniProt

• Presentation through VectorBase

Start

1.0 set(automatic

)

1.1 set(published

)

Page 11: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Projection from a reference annotation

Page 12: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Gene prediction based on projection from reference annotation

• Local alignment of An. gambiae CDS to the assemblies provide a platform for improving gene predictions.

• Example loci: Rps7 (AGAP008916)

• Potential for transcript based assembly improvement via seqedits of genome sequence

Page 13: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Annotation: Preliminary genesets

• 10,738 - 13,162 predictions

• no ncRNAs yet predicted

Page 14: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Preliminary comparative analysis

• OrthoMCL runs including 17 species

• An. gambiae PEST 12,810 protein-coding genes

An. darlingi

Glossina morsitans

Lutzomyia longipalpis

Phlebotomus papatasi

Page 15: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Preliminary comparative analysis

• OrthoMCL runs including 17 species

• No. of clusters containing all 13 mosquitoes 4961 (≃ 39%)

An. darlingi

Glossina morsitans

Lutzomyia longipalpis

Phlebotomus papatasi

Page 16: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Preliminary comparative analysis

• OrthoMCL runs including 17 species

• No. of clusters containing all 13 mosquitoes 4961 (≃ 39%)

• No. of clusters containing all 11 Anophelines 5463 (≃ 43%)

An. darlingi

Glossina morsitans

Lutzomyia longipalpis

Phlebotomus papatasi

Page 17: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Preliminary comparative analysis

• OrthoMCL runs including 17 species

• No. of clusters containing all 13 mosquitoes 4961 (≃ 39%)

• No. of clusters containing all 11 Anophelines 5463 (≃ 43%)

• No. of clusters containing 10 Anophelines (minus darlingi) 6606 (≃ 52%)

An. darlingi

Glossina morsitans

Lutzomyia longipalpis

Phlebotomus papatasi

Page 18: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Preliminary comparative analysis

• OrthoMCL runs including 17 species

• No. of clusters containing all 13 mosquitoes 4961 (≃ 39%)

• No. of clusters containing all 11 Anophelines 5463 (≃ 43%)

• No. of clusters containing 10 Anophelines (minus darlingi) 6606 (≃ 52%)

• No. of clusters containing 9 Anophelines (minus darlingi & christyi) 7477 (≃ 58%)

An. darlingi

Glossina morsitans

Lutzomyia longipalpis

Phlebotomus papatasi

Page 19: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Preliminary comparative analysis

• OrthoMCL runs including 17 species

• No. of clusters containing all 13 mosquitoes 4961 (≃ 39%)

• No. of clusters containing all 11 Anophelines 5463 (≃ 43%)

• No. of clusters containing 10 Anophelines (minus darlingi) 6606 (≃ 52%)

• No. of clusters containing 9 Anophelines (minus darlingi & christyi) 7477 (≃ 58%)

• No. of clusters containing representatives of the gambiae complex (ar/ga/qu) 9089 (≃ 71%)

An. darlingi

Glossina morsitans

Lutzomyia longipalpis

Phlebotomus papatasi

Page 20: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Preliminary comparative analysis

• OrthoMCL runs including 17 species

• No. of clusters containing all 13 mosquitoes 4961 (≃ 39%)

• No. of clusters containing all 11 Anophelines 5463 (≃ 43%)

• No. of clusters containing 10 Anophelines (minus darlingi) 6606 (≃ 52%)

• No. of clusters containing 9 Anophelines (minus darlingi & christyi) 7477 (≃ 58%)

• No. of clusters containing representatives of the gambiae complex (ar/ga/qu) 9089 (≃ 71%)

• No. of clusters containing 8 Anophelines (- darlingi & christyi) but not gambiae 600

An. darlingi

Glossina morsitans

Lutzomyia longipalpis

Phlebotomus papatasi

Page 21: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

All genomes deserves a home

• Genome browser

• Similarity searches

• BLAST/BLAT

• Query tools

• Simple keyword

• Complex queries

• DownloadsSimilarity searches

Query tool

Downloads

Browser

Browser

Compara

Page 22: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

VectorBase

• Long term home for these genomes is VectorBase.

• NIAID-funded Bioinformatic Resource Center focused on arthropod vectors of human pathogens

• Ensembl genome browser

• Similarity searches

• File downloads

Page 23: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Anopheles Genomes Cluster wiki site

Page 24: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Thematic analysis groups & community annotation

• Community led annotation of the genomes using the Community Annotation Portal (CAP)

Page 25: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Community annotation decision tree

Page 26: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Community annotation decision tree

Page 27: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Community annotation decision tree

Page 28: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Community annotation decision tree

Page 29: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Community annotation workflow

ARTEMIS APOLLO

scf7180000638805 ptn2genome ptn_match 52 605 892 + . ID=xxxx;Name=tr|Q3UIQ2|scf7180000638805 ptn2genome ptn_match 78 205 960 + . ID=xxxx2;Name=tr|Q3TIU7|scf7180000638805 ptn2genome ptn_match 52 305 696 + . ID=xxxx3;Name=sp|Q91VD9|scf7180000638805 ptn2genome ptn_match 78 205 950 + . ID=xxxx2;Name=tr|Q3VIU732|

scf7180000638805 ptn2genome ptn_match 52 605 892 + . ID=xxxx;Name=tr|Q3UIQ2|scf7180000638805 ptn2genome ptn_match 78 205 960 + . ID=xxxx2;Name=tr|Q3TIU7|scf7180000638805 ptn2genome ptn_match 78 205 950 + . ID=xxxx2;Name=tr|Q3VIU732|

>MY SUPERCONTIGATATATGCGTTGAGCTGCGTTACGTTCGGGATGCGTTAGGCTTGTGAGCTGGATCGGTCCTGCCTGCGTCGATATAAACGACCT…

Identify gene

Modify model

SubmitCAP

GFF3 FASTA

Page 30: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

CAP reporting

• Email report back to submitter to show status

• If successful then the model is stored in a local database and then presented to the genome browser via DAS

• Failed submissions have (some) information as to why. Submitters then need to correct these errors and re-submit

Page 31: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

CAP submissions displayed in the genome browser

• Similarity track for supporting evidence (from previous updates)

Page 32: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Genome annotation metrics

• Metrics for quality of a gene set are far from standardised but...

• Simple statistics (length, number of exons, intron size)

• Level of support from transcript data (how many genes have overlapping EST/RNAseq)

• Junction data (confirmation of introns)

• Comparison to public datasets (UniProt)

• Protein domains (InterPro)

• Comparative analysis - orthologs/paralogs

Page 33: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Still to do...

Primary annotation

• Still 7 genomes outstanding from the Broad Institute - de novo repeat finding and MAKER annotation

Analysis

• Whole genome alignments and (12 Drosopholid analysis pipelines from Kellis group - Rob Waterhouse)

• Data presentation (Trinity clusters, correlation with legacy Hittinger clusters, velvet assembled 37 bp reads)

• Variation (SNP calls) from each of the 16 species

Other genomes

• New version of the An. darlingi genome (Osvaldo Marinotti, recently published in NAR)

• New version of the Indian strain of An. stephensi (Jake Tu)

Page 34: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Acknowledgements

VEMBL-EBI

Imperial College

Daniel Lawson, Gareth Maslen, Mikkel Christensen, Nick Langridge, Derek Wilson, Gautier Koscielny, Karyn Megy, Martin Hammond, Daniel Hughes, Ewan Birney, Paul Kersey

Fotis Kafatos, Bob MacCallum, George Christophides, Seth Redmond, Timo Tiirikka

NoTre Dame

HaRvardIMBB

New MexicO

ASequencers

Ensembl GEnomes

Maggie Werner-Washburne Phil Baker

Bill Gelbart, Susan Russo, Dave Emmert, Pinglei Zhou, Lynn Crosby, Kathy Campbell

Kitsos Louis, Pantelis Topalis, Emmanuel Dialynas, Vicky Dritsou

TIGR/JCVI WashU Broad Institute, Baylor College

Frank Collins, Greg Madey, Rob Bruggner, Nate Konopinski, EO Stinson, Scott Emrich, Andrew Sheehan, Rory Carmichael, Dave Cieslak, Dave Campbell, Ryan Butler, Katie Cybulski, Neil Lobo, Gloria Calderon, Greg Davis

Dan Neafsey, Brian Haas Nora Besansky, Michael Fontaine

Michael Nuhn

Rob Waterhouse Paul Howell

Page 36: Annotation of Anopheline Genomes at VectorBase Dan Lawson, VectorBase & The Anopheles Genomes Cluster Consortium EMBL-EBI.

Anopheles Genomes Cluster Consortium

Steering committee

Community liaisons


Recommended