Outline
● Gene annotation o Gene automatic annotationso Gene manual annotation and metadatao Basics: A good vs a bad gene modelo Why do we need gene manual annotations and gene metadata?
● Why did we replace the Community Manual Annotation (CAP) with Web Apollo (WA)?o Offline vs. onlineo Advantages vs disadvantages
● How do we interact with WA developers and outreach representatives?
● How do we get the community to submit data?
VectorBase gene “automatic” annotations
gap
100 Ns
Scaffolds orSupercontigs
mapping (Optional. Not possible with bioinformatics, must be experimental)
Gene prediction: evidence based (BLAST), Ab initio (SNAP), experimental evidence (ESTs, RNAseq, protein or peptide sequencing)
Metadata
- VectorBase gene ID (e.g., AGAP000002)
- Organism (species) (e.g., Anopheles gambiae)
- Symbol (e.g., para)
- Synonym (e.g., kdr, VSC)
- Description (e.g., voltage-gated sodium channel)
- Comments/notes (e.g., truncated gene, other part on scaffold xxx)
- Homologs and Phylogenetics - Ontology- Variation (e.g., Single Nucleotide Polymorphisms, SNPs)
Why do we need gene manual annotations and gene metadata?
For downstream analyses of gene(s), gene families or genomes such as:
Homologs and Phylogenetics
- wrong assignment of orthologs and paralogs- gene alignment ---> tree- wrong inference evolutionary relationships
between genes or species- branches with a wrong length, could lead to
misleading lineages changes over time (the longest the branch the larger the amount of change)
- wrong estimates about the ancestral and derived states, genes or species
- wrong taxonomic interpretations
OntologyGO: biological process(ion transport, sodium i.t., transmembrane transport )
GO: molecular function(ion channel activity, voltage-gated sodium
channel activity, calcium ion binding)
GO: cellular component(voltage-gated sodium channel,
membrane)
Variation (e.g., Single Nucleotide Polymorphisms, SNPs)
. . . T T A . . .
. . . T T T . . .
SNP
L 1014 F
Leucine ---> Phenylalanine
Hypothetical example:
- User is interested in gene “x”- They download this gene from VB - Start analyses- Finds/reports the presence/absence of the
SNP- If the gene of interest is not correctly
annotated, e.g., missing an exon or part of an exon, results are going to be wrong
- The size of the genomes
- The phylogenetic distance among genomes
Number of genomes (genome size):- VB: 37 (110 Mbp – 3,000 Mbp) - EuPathDB: 186 (2 Mbp – 193 Mbp)- PATRIC: 3,481 Bacteria & 186 Archaea (10 kbp – 14 Mbp)- ViPR: 546,381 & IRD: 365,618 (few kbp – 250 kbp)
Offline vs. online curation
Community Manual Annotation (CAP)
Web Apollo
gene models
RNAseq
User-created Annotations
Advantages & Disadvantages
Community Manual Annotation (CAP)
- People had to use Artemis or (Desktop) Apollo: requires downloading scaffolds or supercontigs from VB
- VB gene updates can take 2 months or more → more than one person working on the same gene
- Most of the time our internal GFF3 validator found issues with submitted data files.
Web Apollo
- Is web-based, which allows easier collaboration
- There is not, however, a clear way to indicate/know when a user is “still working” or “done” with an annotation.
- New annotations though are instantaneously visualized by all users of WA.
How do we interact with Web Apollo developers and outreach representatives?
- Developers: ○ Monthly WA developers open conference call○ email
- Outreach: ○ Meetings, workshops and conferences○ email or phone
We are also subscribed to their user email list (help desk).
How do we get the community to submit data?
- First invitation comes from genome leaders directly (genome paper)
- Users send emails to our help desk ([email protected])
- During outreach events, such as workshops, meetings and conferences
- Social media post (Facebook and Twitter)
- Help content: Tutorial page
Genome group manual annotation efforts
- Workshops- Annotation jamborees- Webinars- Independent work
Help content: Tutorial page
- Decision tree
- FAQs
- Web Apollo resources: user guide, slides with speaker notes, sample exercises
- Documentation about available tracks
- Video tutorial (Intro, ~ 50 min) and a video clip (Intron/exon boundaries ~2:45 min)