Introduction to Apollo: i5K E affinis

Introduction to Apollo C o l l a b o r a t i v e g e n o m e a n n o t a t i o n e d i t i n g A webinar for the i5K Research Community | Dr. Carol Lee’s Lab.

Monica Munoz-Torres, PhD | @monimunozto

Berkeley Bioinformatics Open-Source Projects (BBOP)Lawrence Berkeley National Laboratory | University of California Berkeley | U.S. Department of Energy

Dr. Carol Lee’s Lab | 23 October, 2015

OUTLINE

Web Apollo Collabora've Cura'on and Interac've Analysis of Genomes

2 OUTLINE

•  Today we will discover how to extract very valuable informa'on about a genome through cura'on efforts.

APOLLO DEVELOPMENT

APOLLO DEVELOPERS 3

h*p : / /GenomeA r c h i t e c t . o r g /

Nathan Dunn

Eric Yao JBrowse, UC Berkeley

Christine Elsik’s Lab, University of Missouri

Suzi Lewis Principal Investigator

BBOP

Moni Munoz-Torres

Stephen Ficklin GenSAS,

Washington State University

Colin Diesh Deepak Unni

4

AFTER THIS TALK WE WILL...

v Be@er understand genome cura'on in the context of annota'on: assembled genome à automated annota=on à manual annota=on

v Become familiar with the environment and func'onality of the Apollo genome annota'on edi'ng tool.

v Learn to iden'fy homologs of known genes of interest in a newly sequenced genome.

v Learn about corrobora'ng and modifying automa'cally annotated gene models using available evidence in Apollo.

What to expect

A typical genome sequencing project

6

Genome Sequencing Project

Anatomy of a genome sequencing project

Experimental design, sampling.

Comparative analyses

Consensus Gene Set

Manual Annotation

Automated Annotation

Sequencing Assembly

Synthesis & dissemination.

CURATING GENOMESsteps involved

1  Genera=on of Gene Models calling ORFs, one or more rounds of gene predic'on, etc.

2  Annota=on of gene models Describing func'on, expression pa@erns, metabolic network memberships.

3  Manual annota=on

CURATING GENOMES 7

GENOME ANNOTATIONobjectives and uses

Curating Genomes 8

The gene set of an organism informs a variety of studies: •  Gene number, GC%, TE composi'on, repe''ve regions. •  Func'onal assignments.

•  Molecular evolu'on, sequence conserva'on. •  Gene families. •  Metabolic pathways. •  What makes an organism what it is?

What makes a bee a “bee”?

Marbach et al. 2011. Nature Methods | Shutterstock.com | Alexander Wild

Bio-‐refresher

REMEMBER... for manual annotation

To remember… Biological concepts to be@er understand manual annota'on

10 BIO-REFRESHER

•  GLOSSARY from con1g to splice site

•  CENTRAL DOGMA

in molecular biology •  WHAT IS A GENE?

defining your goal

•  TRANSCRIPTION mRNA in detail

•  TRANSLATION

and other defini'ons

•  GENOME CURATION steps involved

11 CURATING GENOMES

CENTRAL “DOGMA”of molecular biology

v  DNA can be copied to DNA (DNA replica'on),

v  DNA informa'on can be copied into mRNA (transcrip'on), and

v  Proteins can be synthesized using the informa'on in mRNA as a template (transla'on).

http://www.wisegeek.com/

12 BIO-REFRESHER

WHAT IS A GENE?

v  A con'nuously evolving concept paints a very complex picture of molecular ac'vity:

“A gene is a locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions and/or other func'onal sequence regions”.

-‐ The Sequence Ontology

13 BIO-REFRESHER

WHAT IS A GENE?

v  ... also long transcripts, dispersed regula1on.

“The gene is a DNA segment that contributes to phenotype and func'on. In the absence of demonstrated func'on, a gene may be characterized by sequence, transcrip'on or homology.”

-‐ The ENCODE Project

https://www.encodeproject.org/

14 BIO-REFRESHER

“The gene is a union of genomic sequences encoding a coherent set of poten'ally

overlapping func'onal products.”

Gerstein et al., 2007. Genome Res

THE GENE: a moving target

15 BIO-REFRESHER

TRANSLATIONreading frames

v  Reading frame is a manner of dividing the sequence of nucleo'des in mRNA (or DNA) into a set of consecu've, non-‐overlapping triplets (codons).

v  Three frames can be read in the 5’ à 3’ direc'on. Given that DNA has two an'-‐parallel strands, an addi'onal three frames are possible to be read on the an'-‐sense strand. Six total possible reading frames exist.

v  In eukaryotes, only one reading frame per sec'on of DNA is biologically relevant at a 'me: it has the poten'al to be transcribed into RNA and translated into protein. This is called the OPEN READING FRAME (ORF) •  ORF = Start signal + coding sequence (divisible by 3) + Stop signal

v  The sec'ons of the mature mRNA transcribed with the coding sequence but not translated are called UnTranslated Regions (UTR); one at each end.

16

"Reading Frame" by Hornung Ákos - Wikimedia Commons

BIO-REFRESHER

TRANSLATIONreading frame

17 BIO-REFRESHER

TRANSLATIONsplice sites

v  The spliceosome catalyzes the removal of introns and the liga'on of flanking exons. •  introns: spaces inside the gene, not part of the coding sequence •  exons: expression units (of the coding sequence)

v  Splicing signals (from the point of view of an intron): •  One splice signal (site) on the 5’ end: usually GT (less common: GC) •  And a 3’ end splice site: usually AG •  Canonical splice sites look like this: …]5’-‐GT/AG-‐3’[…

v  It is possible to produce more than one protein (polypep'de) sequence from the same genic region, by alterna'vely bringing exons together= alterna=ve splicing. For example, the gene Dscam (Drosophila) has 38,000 alterna'vely spliced mRNAs = isoforms

18 BIO-REFRESHER

TRANSLATIONphase

v  Introns can interrupt the reading frame of a gene by inser'ng a sequence between two consecu've codons

v  Between the first and second nucleo'de of a codon

v  Or between the second and third nucleo'de of a codon

"Exon and Intron classes”. Licensed under Fair use via Wikipedia

19

"Gene structure" by Daycd- Wikimedia Commons

BIO-REFRESHER

mRNAnow in your mind

•  Although of brief existence, understanding mRNAs is crucial, as they will become the center of your work.

20

"Protein synthesis" by Kelvinsong - Wikimedia Commons

CURATING GENOMES

TRANSLATIONin detail

Predic'on & Annota'on

22 GENE PREDICTION & ANNOTATION

PREDICTION & ANNOTATION

v  Iden'fica'on and annota'on of genome features:

•  primarily focuses on protein-‐coding genes. •  also iden'fies RNAs (tRNA, rRNA, long and small non-‐coding

RNAs (ncRNA)), regulatory mo'fs, repe''ve elements, etc.

•  happens in 2 phases: 1.  Computa'on phase 2.  Annota'on phase


COMPUTATION PHASE

a.   Experimental data are aligned to the genome: expressed sequence tags, RNA-‐sequencing reads, proteins (homologous and heterologous).

b.   Gene predic=ons are generated: -‐ ab ini1o: based on nucleo'de sequence and composi'on e.g. Augustus, GENSCAN, geneid, fgenesh, etc.

-‐ evidence-‐driven: iden'fying also domains and mo'fs e.g. SGP2, JAMg, fgenesh++, etc.

Result: the single most likely coding sequence, no UTRs, no isoforms. Yandell & Ence. Nature Rev 2012 doi:10.1038/nrg3174


ANNOTATION PHASE

Experimental data (evidence) and predic'ons are synthe'zed into gene annota'ons.

Result: gene models that [generally] include UTRs, isoforms, evidence trails.

Yandell & Ence. Nature Rev 2012 doi:10.1038/nrg3174

5’ UTR 3’ UTR

25

In some cases algorithms and metrics used to generate consensus sets may actually reduce the accuracy of the gene’s representa'on.

CONSENSUS GENE SETS

Gene models may be organized into sets using: v  combiners for automa'c integra'on of predicted sets

e.g: GLEAN, EvidenceModeler

or v  tools packaged into pipelines

e.g: MAKER, PASA, Gnomon, Ensembl, etc.

GENE PREDICTION & ANNOTATION

ANNOTATIONan imperfect art

No one is perfect, least of all automated annotation. 26

New technology brings new challenges: •  Assembly errors can cause fragmented

annota'ons •  Limited coverage makes precise

iden'fica'on a difficult task

Image: www.BroadInstitute.org

MANUAL ANNOTATIONimproving predictions

Precise elucida=on of biological features encoded in the genome requires careful

examina=on and review.

Schiex et al. Nucleic Acids 2003 (31) 13: 3738-‐3741

Automated Predictions

Experimental Evidence

Manual Annotation – to the rescue. 27

cDNAs, HMM domain searches, RNAseq, genes from other species.

28

BIOCURATIONstructural and functional adjustments

Iden=fies elements that best represent the underlying biology and eliminates elements that reflect systemic errors of automated analyses.

Assigns func=on through compara've analysis of similar genome elements from closely related species using literature, databases, and experimental data.

MANUAL ANNOTATION

h@p://GeneOntology.org

1

2

GENOME ANNOTATIONan inherently collaborative task

GENE PREDICTION & ANNOTATION 29

Researchers oGen turn to colleagues for second opinions and insight from those with exper1se in

par1cular areas (e.g., domains, families).

So many sequences, not enough hands.

APOLLOcollaborative genome annotation editing tool

30

v  Web based, integrated with JBrowse. v  Supports real 'me collabora'on. v  Automa'c genera'on of ready-‐made

computable data. v  Supports annota'on of genes, pseudogenes,

tRNAs, snRNAs, snoRNAs, ncRNAs, miRNAs, TEs, and repeats.

v  Intui've annota'on, gestures, and pull-‐down menus to create and edit transcripts and exons structures, insert comments (CV, freeform text), associate GO terms, etc.

APOLLO

h@p://GenomeArchitect.org

Con'nuous training and support for hundreds of geographically dispersed scien'sts, from diverse research communi'es, in conduc'ng manual annota'ons efforts to recover coding sequences in agreement with all available biological evidence using Apollo.

31

LESSONS LEARNED

APOLLO

•  Collabora've work dis'lls invaluable knowledge

32

A LITTLE TRAINING GOES A LONG WAY!

Provided with adequate tools, wet lab scien'sts make excep'onal curators who can easily learn to maximize the genera'on of accurate,

biologically supported gene models.

APOLLO

Apollo

Sort

Apollo - current version at i5K Workspace@NAL

34

The Sequence Selec'on Window

4. Becoming Acquainted with Web Apollo.

34

35

APOLLOannotation editing environment

BECOMING ACQUAINTED WITH APOLLO

Color by CDS frame, toggle strands, set color scheme and highlights.

-‐ Upload evidence files (GFF3, BAM, BigWig), -‐ combina=on track -‐ sequence search track

Query the genome using BLAT.

Naviga'on and zoom.

Search for a gene model or a scaffold.

Get coordinates and “rubber band” selec'on for zooming.

Login

User-‐created annota'ons. New

annotator panel.

Evidence Tracks

Stage and cell-‐type specific transcrip'on data.

h@p://genomearchitect.org/web_apollo_user_guide

Naviga'on tools: pan and zoom Search box: go to

a scaffold or a gene model.

Grey bar of coordinates indicates loca'on. You can also select here in order to zoom to a sub-‐region.

‘View’: change color by CDS, toggle strands, set highlight.

‘File’: Upload your own evidence: GFF3, BAM, BigWig, VCF*. Add combina'on and sequence search tracks.

‘Tools’: Use BLAT to query the genome with a protein or DNA sequence.

Available Tracks

Evidence Tracks Area

‘User-‐created Annota'ons’ Track

Login

36

APOLLOgraphical user interface (GUI) for editing annotations

Becoming Acquainted with Web Apollo.

Cura'ng with Apollo

BECOMING ACQUAINTED WITH APOLLO 38 | 38

GENERAL PROCESS OF CURATIONmain steps to remember

1.  Select or find a region of interest, e.g. scaffold. 2.  Select appropriate evidence tracks to review the gene model.

3.  Determine whether a feature in an exis'ng evidence track will provide a reasonable gene model to start working.

4.  If necessary, adjust the gene model.

5.  Check your edited gene model for integrity and accuracy by comparing it with available homologs.

6.   Comment and finish.

USER NAVIGATIONremovable side dock

HIGHLIGHTED IMPROVEMENTS 39

Annotations Organism Users Groups Admin Tracks Reference Sequence

EDITS & EXPORTSannotation details, exon boundaries, data export


1 2

Annotations

1

2

gene

mRNA


Reference Sequences

3

FASTA

GFF3

EDITS & EXPORTSannotation details, exon boundaries, data export

3

42 | 42 BECOMING ACQUAINTED WITH APOLLO

USER NAVIGATION

Annotator panel.

•  Choose appropriate evidence from list of “Tracks” on annotator panel.

•  Select & drag elements from evidence track into the ‘User-‐created Annota1ons’ area.

•  Hovering over annota'on in progress brings up an informa'on pop-‐up.

•  Crea'ng a new annota'on

43 | 43

USER NAVIGATION


•  Annota'on right-‐click menu

44 | 44

USER NAVIGATION


•  ‘Zoom to base level’ op'on reveals the DNA Track.

45 | 45

USER NAVIGATION


•  Color exons by CDS from the ‘View’ menu.

46 |

Zoom in/out with keyboard: shiv + arrow keys up/down

46

USER NAVIGATION


•  Toggle reference DNA sequence and transla=on frames in forward strand. Toggle models in either direc'on.

Annota'on

simple cases

“Simple case”: -‐ the predicted gene model is correct or nearly correct, and

-‐ this model is supported by evidence that completely or mostly agrees with the predic'on.

-‐ evidence that extends beyond the predicted model is assumed to be non-‐coding sequence.

The following are simple modifica'ons.

49 | 49

ANNOTATING SIMPLE CASES

BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES

•  A confirma'on box will warn you if the receiving transcript is not on the same strand as the feature where the new exon originated.

•  Check ‘Start’ and ‘Stop’ signals aver each edit.

50

ADDING EXONS


If transcript alignment data are available & extend beyond your original annota'on, you may extend or add UTRs.

1.  Right click at the exon edge and ‘Zoom to base level’.

2.  Place the cursor over the edge of the exon un1l it becomes a black arrow then click and drag the edge of the exon to the new coordinate posi'on that includes the UTR.

51

ADDING UTRs

To add a new spliced UTR to an exis'ng annota'on also follow the procedure for adding an exon.


To modify an exon boundary and match data in the evidence tracks: select both the [offending] exon and the feature with the expected boundary, then right click on the annota'on to select ‘Set 3’ end’ or ‘Set 5’ end’ as appropriate.

In some cases all the data may disagree with the annota'on, in other cases some data support the annota'on and some of the

data support one or more alterna've transcripts. Try to annotate as many alterna've transcripts as are well supported by the data.

52

MATCHING EXON BOUNDARY TO EVIDENCE


Non-‐canonical splice sites flags. Double click: selec'on of feature and sub-‐features

Evidence Tracks Area

‘User-‐created Annota1ons’ Track

Edge-‐matching

Apollo’s edi'ng logic (brain): §  selects longest ORF as CDS §  flags non-‐canonical splice sites

53

ORFs AND SPLICE SITES


Non-‐canonical splices are indicated by an orange circle with a white exclama'on point inside, placed over the edge of the offending exon.

Canonical splice sites:

3’-‐…exon]GA / TG[exon…-‐5’

5’-‐…exon]GT / AG[exon…-‐3’ reverse strand, not reverse-‐complemented:

forward strand

54

SPLICE SITES

Zoom to review non-‐canonical splice site warnings. Although these may not always have to be corrected (e.g GC donor), they should be flagged with a comment.

Exon/intron splice site error warning

Curated model


Apollo calculates the longest possible open reading frame (ORF) that includes canonical ‘Start’ and ‘Stop’ signals within the predicted exons.

If ‘Start’ appears to be incorrect, modify it by selec'ng an in-‐frame ‘Start’ codon further up or downstream, depending on evidence (proteins, RNAseq).

It may be present outside the predicted gene model, within a region supported by another evidence track.

In very rare cases, the actual ‘Start’ codon may be non-‐canonical (non-‐ATG).

55

‘Start’ AND ‘Stop’ SITES


1.   Zoom in to clearly resolve each exon as a dis'nct rectangle.

2.  Two exons from different tracks sharing the same start/end coordinates display a red bar to indicate matching edges.

3.  Selec'ng the whole annota'on or one exon at a 'me, use this edge-‐matching func'on and scroll along the length of the annota'on, verifying exon boundaries against available data. Use square [ ] brackets to scroll from exon to exon. User curly { } brackets to scroll from annota'on to annota'on.

4.  Check if cDNA / RNAseq reads lack one or more of the annotated exons or include addi'onal exons.

56

CHECKING EXON INTEGRITY


complex cases

Evidence may support joining two or more different gene models. Warning: protein alignments may have incorrect splice sites and lack non-‐conserved regions!

1.  In ‘User-‐created Annota<ons’ area shiv-‐click to select an intron from each gene model and right click to select the ‘Merge’ op'on from the menu.

2.  Drag suppor'ng evidence tracks over the candidate models to corroborate overlap, or review edge matching and coverage across models.

3.  Check the resul'ng transla'on by querying a protein database e.g. UniProt, NCBI nr. Add comments to record that this annota'on is the result of a merge.

58

Red lines around exons: ‘edge-‐matching’ allows annotators to confirm whether the evidence is in agreement without examining each exon at the base level.

COMPLEX CASES merge two gene predictions on the same scaffold

BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES

One or more splits may be recommended when: -‐ different segments of the predicted protein align to two or more different gene families -‐ predicted protein doesn’t align to known proteins over its en're length -‐ Transcript data may support a split, but first verify whether they are alterna've transcripts.

59

COMPLEX CASES split a gene prediction


DNA Track

‘User-‐created Annota=ons’ Track

60

COMPLEX CASES correcting frameshifts and single-base errors

Always remember: when annota'ng gene models using Apollo, you are looking at a ‘frozen’ version of the genome assembly and you will not be able to modify the assembly itself.


61

COMPLEX CASES correcting selenocysteine containing proteins


62

COMPLEX CASES correcting selenocysteine containing proteins


1.  Apollo allows annotators to make single base modifica'ons or frameshivs that are reflected in the sequence and structure of any transcripts overlapping the modifica'on. These manipula'ons do NOT change the underlying genomic sequence.

2.  If you determine that you need to make one of these changes, zoom in to the nucleo'de level and right click over a single nucleo'de on the genomic sequence to access a menu that provides op'ons for crea'ng inser'ons, dele'ons or subs'tu'ons.

3.  The ‘Create Genomic Inser<on’ feature will require you to enter the necessary string of nucleo'de residues that will be inserted to the right of the cursor’s current loca'on. The ‘Create Genomic Dele<on’ op'on will require you to enter the length of the dele'on, star'ng with the nucleo'de where the cursor is posi'oned. The ‘Create Genomic Subs<tu<on’ feature asks for the string of nucleo'de residues that will replace the ones on the DNA track.

4.  Once you have entered the modifica'ons, Apollo will recalculate the corrected transcript and protein sequences, which will appear when you use the right-‐click menu ‘Get Sequence’ op'on. Since the underlying genomic sequence is reflected in all annota'ons that include the modified region you should alert the curators of your organisms database using the ‘Comments’ sec'on to report the CDS edits.

5.  In special cases such as selenocysteine containing proteins (read-‐throughs), right-‐click over the offending/premature ‘Stop’ signal and choose the ‘Set readthrough stop codon’ op'on from the menu.

63

COMPLEX CASES correcting frameshifts, single-base errors, and selenocysteines


64 | 64

USER NAVIGATION


•  Annotation right-click menu

65

Annota'ons, annota'on edits, and History: stored in a centralized database.

65

USER NAVIGATION


Follow the checklist un'l you are happy with the annota'on!

And remember to… –  comment to validate your annota'on, even if you made no changes to an exis'ng model. Think of comments as your vote of confidence.

–  or add a comment to inform the community of unresolved issues you think this model may have.

66 | 66

Always Remember: Apollo cura'on is a community effort so please use comments to communicate the reasons for your

annota'on. Your comments will be visible to everyone.

COMPLETING THE ANNOTATION


67 | 67

USER NAVIGATION


•  Annotation right-click menu

68

The Annota'on Informa=on Editor

68

USER NAVIGATION


DBXRefs are database crossed references: if you have reason to believe that this gene is linked to a gene in a public database (including your own), then add it here.

69

The Annota'on Informa=on Editor

•  Add PubMed IDs •  Include GO terms as appropriate

from any of the three ontologies •  Write comments sta'ng how you

have validated each model.

69

USER NAVIGATION


Checklist

•  Check ‘Start’ and ‘Stop’ sites.

•  Check splice sites: most splice sites display these residues …]5’-‐GT/AG-‐3’[…

•  Check if you can annotate UTRs, for example using RNA-‐Seq data: – align it against relevant genes/gene family – blastp against NCBI’s RefSeq or nr

•  Check for gaps in the genome.

•  Addi'onal func'onality may be necessary: – merging 2 gene predic'ons on the same scaffold

– merging 2 gene predic'ons from different scaffolds

–  splifng a gene predic'on –  correc'ng frameshigs and other errors in the genome assembly

– annota'ng selenocysteines, correc'ng single-‐base errors, etc.

71 | 71

•  Add: –  Important project informa'on in the form of

comments –  IDs from public databases e.g. GenBank (via

DBXRef), gene symbol(s), common name(s), synonyms, top BLAST hits, orthologs with species names, and everything else you can think of, because you are the expert.

–  Comments about the kinds of changes you made to the gene model of interest, if any.

–  Any appropriate func'onal assignments, e.g. via BLAST, RNA-‐Seq data, literature searches, etc.

CHECKLIST for accuracy and integrity

MANUAL ANNOTATION CHECKLIST

Cura'ng within i5K

73 i5K Workspace@NAL

THE COLLABORATIVE CURATION PROCESS AT i5K

1.  A computa'onally predicted consensus gene set has been generated using mul'ple lines of evidence; e.g. LDEC_v0.5.3-‐Models

2.  i5K Projects will integrate consensus computa'onal predic'ons with

manual annota'ons to produce an updated Official Gene Set (OGS):

Achtung! •  If it’s not on either track, it won’t make the OGS! •  If it’s there and it shouldn’t, it will s'll make the OGS!

74 i5K Workspace@NAL

THE COLLABORATIVE CURATION PROCESS AT i5K

3.  In some cases algorithms and metrics used to generate consensus sets may actually reduce the accuracy of the gene’s representa'on. User your judgment and choose a different model to annotate.

4.   Isoforms: drag original and alterna'vely spliced form to ‘User-‐created Annota<ons’ area.

5.  If an annota'on needs to be removed from the consensus set, drag it to the ‘User-‐created Annota<ons’ area and label as ‘Delete’ on the Informa1on Editor.

6.  Overlapping interests? Collaborate to reach agreement.

7.  Follow guidelines for i5K Pilot Species Projects, at h@p://goo.gl/LRu1VY

Example

Example

Example 76

Cura'on example using the Hyalella azteca genome (amphipod crustacean).

What do we know about this genome?

•  Currently publicly available data at NCBI: •  >37,000 nucleo'de seqsà scaffolds, mitochondrial genes •  344 amino acid seqsà mitochondrion •  47 ESTs •  0 conserved domains iden'fied •  0 “gene” entries submi@ed

•  Data at i5K Workspace@NAL (annota'on hosted at USDA) -‐ 10,832 scaffolds: 23,288 transcripts: 12,906 proteins

Example 77

PubMed Search: what’s new?

Example 78

PubMed Search: what’s new?

Example 79

“Ten popula'ons (3 cultures, 7 from California water bodies) differed by at least 550-‐fold in sensi=vity to pyrethroids.”

“By sequencing the primary pyrethroid target site, the voltage-‐gated sodium channel (vgsc), we show that point muta'ons and their spread in natural popula'ons were responsible for differences in pyrethroid sensi'vity.”

“The finding that a non-‐target aqua'c species has acquired resistance to pes'cides used only on terrestrial pests is troubling evidence of the impact of chronic pes=cide transport from land-‐based applica'ons into aqua'c systems.”

How many sequences are there, publicly available, for our gene of interest?

Example 80

•  Para, (voltage-‐gated sodium channel alpha subunit; Nasonia vitripennis).

•  NaCP60E (Sodium channel protein 60 E; D. melanogaster). –  MF: voltage-‐gated ca'on channel ac'vity (IDA, GO:0022843).

–  BP: olfactory behavior (IMP, GO:0042048), sodium ion transmembrane transport (ISS,GO:0035725).

–  CC: voltage-‐gated sodium channel complex (IEA, GO:0001518).

And what do we know about them?

Retrieving sequences for a sequence similarity search.

Example 81

>vgsc-‐Segment3-‐DomainII RVFKLAKSWPTLNLLISIMGKTVGALGNLTFVLCIIIFIFAVMGMQLFGKNYTEKVTKFKWSQDGQMPRWNFVDFFHSFMIVFRVLCGEWIESMWDCMYVGDFSCVPFFLATVVIGNLVVSFMHR

BLAT search

input

Example 82


BLAT search

results

Example 83

•  High-‐scoring segment pairs (hsp) are listed in tabulated format.

•  Clicking on one line of results sends you to those coordinates.

BLAST at i5K h*ps://i5k.nal.usda.gov/blast

Example 84


BLAST at i5K h*ps://i5k.nal.usda.gov/blast

Example 85

BLAST at i5K: hsps in “BLAST+ Results” track

Example 86

Creating a new gene model: drag and drop

Example 87

•  Apollo automa'cally calculates longest ORF.

•  In this case, ORF includes the high-‐scoring segment pairs (hsp), marked here in blue.

•  Note that gene is transcribed from reverse strand.

Available Tracks

Example 88

Get Sequence

Example 89

http://blast.ncbi.nlm.nih.gov/Blast.cgi

Also, flanking sequences (other gene models) vs. NCBI nr

Example 90

In this case, two gene models upstream, at 5’ end.

BLAST hsps

Review alignments

Example 91

HaztTmpM006234

HaztTmpM006233

HaztTmpM006232

Hypothesis for vgsc gene model

Example 92

Editing: merge the three models

Example 93

Merge by dropping an exon or gene model onto another.

Merge by selec'ng two exons (holding down “Shiv”) and using the right click menu.

or…

Result of merging the gene models:

Example 94

Editing: correct offending splice site

Example 95

Modify exon / intron boundary: -‐  Drag the end of the

exon to the nearest canonical splice site.

or

-‐  Use right-‐click menu.

Editing: set translation start

Example 96

Editing: delete exon not supported by evidence

Example 97

Delete first exon from HaztTmpM006233

Editing: add an exon supported by RNAseq

Example 98

•  RNAseq reads show evidence in support of transcribed product, which was not predicted. •  Add exon at coordinates 97946-‐98012 by dragging up one of the RNAseq reads.

Editing: adjust offending splice site using evidence

Example 99

Editing: adjust other boundaries supported by evidence

Example 100

Finished model

Example 101

Corroborate integrity and accuracy of the model: -‐ Start and Stop -‐ Exon structure and splice sites …]5’-‐GT/AG-‐3’[… -‐ Check the predicted protein product vs. NCBI nr, UniProt, etc.

Information Editor

•  DBXRefs: e.g. NP_001128389.1, N. vitripennis, RefSeq

•  PubMed iden'fier: PMID: 24065824

•  Gene Ontology IDs: GO:0022843, GO:0042048, GO:0035725, GO:0001518.

•  Comments

•  Name, Symbol

•  Approve / Delete radio bu@on

Example 102

Comments (if applicable)

Go play!

PUBLIC DEMO 104 | 104

APOLLO ON THE WEBinstructions

At i5K 1.  Register for access to Apollo at the i5K Workspace@NAL at

h@ps://i5k.nal.usda.gov/web-‐apollo-‐registra'on

2.  Contact the coordinator for each species community to receive more informa'on about how to contribute. Contact info is available on each organism’s page.

PUBLIC DEMO 105 | 105

APOLLO ON THE WEBinstructions

Public Honey bee demo available at: h@p://GenomeArchitect.org/WebApolloDemo

Username: [email protected]

Password: demo

APOLLOdemonstration

PUBLIC DEMO 106

Demonstra'on video is available at h@ps://youtu.be/VgPtAP_fvxY

OUTLINE

Web Apollo Collabora've Cura'on and Interac've Analysis of Genomes

107 OUTLINE

•  BIO-‐REFRESHER biological concepts for cura'on

•  ANNOTATION automa'c predic'ons

•  MANUAL ANNOTATION necessary, collabora've

•  APOLLO

advancing collabora've cura'on •  EXAMPLE

demos

Thank you! 108

•  Berkeley Bioinforma=cs Open-‐source Projects (BBOP), Berkeley Lab: Apollo and Gene Ontology teams. Suzanna E. Lewis (PI).

•  § Chris1ne G. Elsik (PI). University of Missouri.

•  * Ian Holmes (PI). University of California Berkeley.

•  Arthropod genomics community: i5K Steering Commi@ee (esp. Sue Brown (Kansas State)), Alexie Papanicolaou (UWS), and the Honey Bee Genome Sequencing Consor'um.

•  Stephen Ficklin, GenSAS, Washington State University

•  Apollo is supported by NIH grants 5R01GM080203 from NIGMS, and 5R01HG004483 from NHGRI. Both projects are also supported by the Director, Office of Science, Office of Basic Energy Sciences, of the U.S. Department of Energy under Contract No. DE-‐AC02-‐05CH11231

• 

•  For your a*en=on, thank you!

Apollo

Nathan Dunn

Colin Diesh §

Deepak Unni §

Gene Ontology

Chris Mungall

Seth Carbon

Heiko Dietze

BBOP

Apollo: h@p://GenomeArchitect.org

GO: h@p://GeneOntology.org

i5K: h@p://arthropodgenomes.org/wiki/i5K

Thank you!

NAL at USDA Monica Poelchau Christopher Childers Gary Moore Mei-‐Ju Chen

HGSC at BCM

fringy Richards

Kim Worley

JBrowse Eric Yao *

Date post:	13-Apr-2017
Category:	Science
Upload:	monica-munoz-torres
View:	302 times
Download:	1 times