+ All Categories
Home > Documents > Making best use of TAIR tools and datasets

Making best use of TAIR tools and datasets

Date post: 18-Jan-2016
Category:
Upload: jaunie
View: 44 times
Download: 3 times
Share this document with a friend
Description:
Making best use of TAIR tools and datasets. Philippe Lamesch Donghui Li The Arabidopsis Information Resource www.arabidopsis.org contact us: [email protected]. TAIR: The Arabidopsis Information Resource. collect, curate and distribute information on Arabidopsis - PowerPoint PPT Presentation
Popular Tags:
106
Making best use of TAIR tools and datasets Philippe Lamesch Donghui Li The Arabidopsis Information Resource www.arabidopsis.org contact us: [email protected]
Transcript
Page 1: Making best use of TAIR tools and datasets

Making best use of TAIR tools and datasets

Philippe LameschDonghui Li

The Arabidopsis Information Resourcewww.arabidopsis.org

contact us: [email protected]

Page 2: Making best use of TAIR tools and datasets

TAIR: The Arabidopsis Information Resource

• collect, curate and distribute information on Arabidopsis• information freely available from arabidopsis.org

Page 3: Making best use of TAIR tools and datasets

• Gene structure – Philippe Lamesch

• Gene function – Donghui Li

• Metabolic pathway – Donghui Li

• New tools – Philippe Lamesch

Outline

Page 4: Making best use of TAIR tools and datasets

Slides available from TAIR www.arabidopsis.org

Page 5: Making best use of TAIR tools and datasets

TAIR is used worldwideVisits per month (source: Google Analytics)

Page 6: Making best use of TAIR tools and datasets

TAIR usage in Asia: June 2009-June 2010

Page 7: Making best use of TAIR tools and datasets

What we do: (1) Arabidopsis genome annotation

Page 8: Making best use of TAIR tools and datasets

What we do: (2) manual literature curation

• Controlled vocabulary annotations

Gene Ontology (GO) http://www.geneontology.org/

Plant Ontology (PO) http://www.plantontology.org/

• Gene name, symbol

• Allele, phenotype

• Summary statement composition

Page 9: Making best use of TAIR tools and datasets

What we do: (3) metabolic pathway curation

AraCyc

A metabolic pathway database for Arabidopsis thaliana that contains information about both predicted and experimentally determined pathways, reactions, compounds, genes and enzymes.

PlantCyc and PMN (Plant Metabolic Network)

Page 10: Making best use of TAIR tools and datasets

What we do: (4) work with ABRC to distribute research material

Page 11: Making best use of TAIR tools and datasets

Part I: The Arabidopsis genome annotation

• A new approach for improving the Arabidopsis genome annotation• Where to find gene structure related data at TAIR• The Arabidopsis gene structure confidence ranking

Page 12: Making best use of TAIR tools and datasets

Arabidopsis genome annotation

• Arabidopsis genome sequenced almost 10 years ago• High quality sequence with few gaps• TIGR did initial genome annotation• TAIR took over responsibility in 2005• Current TAIR9 stats: 27,379 protein coding genes 4827 pseudogenes or transposable elements 1312 ncRNAs

Page 13: Making best use of TAIR tools and datasets

Genome annotation at TAIRAdd novel genesUpdate exon/intron structures of existing genesDelete mispredicted genesMerge and split genesChange gene typesAdd splice-variants

Page 14: Making best use of TAIR tools and datasets

Genome annotation at TAIR

Annotate ‘atypical’ gene classes

* * * ** * *

Trans. element

Short protein-coding genes

Transposable element genes

Pseudogenes

uORFs (genes within UTR of other genes)

Add novel genesUpdate exon/intron structures of existing genesDelete mispredicted genesMerge and split genesChange gene typesAdd splice-variants

Page 15: Making best use of TAIR tools and datasets

Arabidopsis gene structure annotation A new approach

TAIR6-TAIR9: Use ESTs and cDNAs and a assembly tool called PASA to improve gene structures

TAIR10

TAIR10: Use new experimental data and new prediction tools to further improve gene structure predictions

Page 16: Making best use of TAIR tools and datasets

Using PASA and ESTs/cDNAs

Clustered transcripts

NCBI

Genome annotation TAIR6-TAIR9

Page 17: Making best use of TAIR tools and datasets

Clustered transcripts

Resulting gene model

NCBI

Using PASA and ESTs/cDNAs

Genome annotation TAIR6-TAIR9

Page 18: Making best use of TAIR tools and datasets

Clustered transcripts

Resulting gene model

Previous gene model

NCBI

comparison

Novel genesNew Splice-variantsGene structure updates

Using PASA and ESTs/cDNAs

Genome annotation TAIR6-TAIR9

Page 19: Making best use of TAIR tools and datasets

ESTs

cDNAs

Radish sequence alignmentsEugene

predictiondicot sequence alignments

monocot sequence alignments

Aceview genepredictions

2 gene isoforms

Manual annotation at TAIR: Apollo

Short MS peptide

Page 20: Making best use of TAIR tools and datasets

TAIR10: using proteomics and RNA-seq data to improve genome annotation

4-step process:1.Mapping RNA seq & Peptides2.Assembly/Gene built3.Manual review4.Integration (genome release/Gbrowse)

Page 21: Making best use of TAIR tools and datasets

Mapping and Assembly1. Mapping• RNA-seq sequences (Tophat (C. Trapnell),

Supersplat (T.C. Mockler))• Peptides (6-frame translation, spliced exon graph)

2. Assembly approaches• Augustus (M. Stanke)o Uses spliced RNA seq reads, peptideso Aim: Identify additional splice-variants, update existing

genes• TAU (T.C. Mockler)o Uses spliced RNA seq readso Aim: Identify additional splice-variants• Cufflinks (C. Trapnell)o Uses spliced and unspliced RNA seq datao Aim: Identify novel genes

Page 22: Making best use of TAIR tools and datasets

Augustus

TopHat, SuperSplat

145,000 RNA-seq junctions based on >1 read

203,000 clustered spliced RNA-seq junctions

(spliced RNA-seq junction)

RNA-seq datasets (Mockler Lab, Ecker Lab)

200 Million aligned RNA-seq reads

Page 23: Making best use of TAIR tools and datasets

Augustus145,000 RNA-seq junctions based on >1 read 260,000 peptides (Baerenfaller et al, Castellana et al)

Augustus gene prediction

+ ESTs & cDNAs+ AGI models

11% of RNA-seq junctions incorporated into Augustus models64% of peptide sequences incorporated into Augustus models

Predicted Augustus models:5461 distinct models1596 novel models

Page 24: Making best use of TAIR tools and datasets

Categorisation/Review

TAU Models

RNA-seq Junctions

Augustus Model

TAIR confidence rank

TAIR Model

Peptides

(Splice variants, NMD targets)

(correction)

(colour reflects matching model)

Incorrect junction in TAIR model

Unsupported exon

Page 25: Making best use of TAIR tools and datasets

Example Augustus update

Page 26: Making best use of TAIR tools and datasets

Example 2 Augustus update

Page 27: Making best use of TAIR tools and datasets

Example Augustus splice variant

Page 28: Making best use of TAIR tools and datasets

Example 2 August splice variant

Page 29: Making best use of TAIR tools and datasets

Augustus/TAU/Cufflinks Augustus• Incorporate 64% of peptides not contained in TAIR, 11 % for RNA-seq

junctions• 5461 potential updated genes• 1596 potential novel genesTAU• 30,083 junctions distinct to Augustus or TAIR models• 10,902 junctions incorporated into 10,491 TAU modelsCufflinks• 367 novel assemblies which fall above the 100 bp & >15 FPKM filter

#TE-filter applied to AUG and cufflinks models 4

Page 30: Making best use of TAIR tools and datasets

Preliminary Results

4

Augustus/TAU/Cufflinks predicted models are classified into categories:

Novel genes Updated genes Splice-variants B-list Rejects

Page 31: Making best use of TAIR tools and datasets

Preliminary Results

4

Augustus/TAU/Cufflinks predicted models are classified into categories:

Novel genes 21 Updated genes 812Splice-variants 2134 B-list 1586 Rejects 2318

Page 32: Making best use of TAIR tools and datasets

Where can you find gene structure data on TAIR?

• ON GENE MODEL PAGE• Graphic of exon-intron structure• Coordinates of each exon• ON GBROWSE• Graphic display of structure and overlapping

evidence data• ON FTP SITE• GFF files with exact structures of each gene model• Files with gene confidence ranking information

Page 33: Making best use of TAIR tools and datasets

Gene Locus Page

Page 34: Making best use of TAIR tools and datasets

Gene Model Page

Page 35: Making best use of TAIR tools and datasets

Where can you find gene structure data on TAIR?

• ON GENE MODEL PAGE• Graphic of exon-intron structure• Coordinates of each exon• ON GBROWSE• Graphic display of structure and overlapping

evidence data• ON FTP SITE• GFF files with exact structures of each gene model• Files with gene confidence ranking information

Page 36: Making best use of TAIR tools and datasets

Gbrowse

Page 37: Making best use of TAIR tools and datasets

GBrowseHeader

Main Browser Window

Track Menu

Page 38: Making best use of TAIR tools and datasets

Where can you find gene structure data on TAIR?

• ON GENE MODEL PAGE• Graphic of exon-intron structure• Coordinates of each exon• ON GBROWSE• Graphic display of structure and overlapping

evidence data• ON FTP SITE• GFF files with exact structures of each gene model• Files with gene confidence ranking information

Page 39: Making best use of TAIR tools and datasets

FTP site

Page 40: Making best use of TAIR tools and datasets

FTP site

Page 41: Making best use of TAIR tools and datasets

FTP site

Page 42: Making best use of TAIR tools and datasets

Where can you find gene structure data on TAIR?

• ON GENE MODEL PAGE• Graphic of exon-intron structure• Coordinates of each exon• ON GBROWSE• Graphic display of structure and overlapping

evidence data• ON FTP SITE• GFF files with exact structures of each gene model• Files with gene confidence ranking information

Page 43: Making best use of TAIR tools and datasets

Gene Confidence Rank

• Attributes confidence scores to all exons and gene models based on different types of experimental and computational evidence

Page 44: Making best use of TAIR tools and datasets

Assigning A Confidence Rank

E1

E4

Page 45: Making best use of TAIR tools and datasets

Full support

No support

Page 46: Making best use of TAIR tools and datasets

New Tools at TAIR

• N-Browse• GBrowse• Synteny viewer

Page 47: Making best use of TAIR tools and datasets

New Tools at TAIR

• N-Browse (in collaboration wit the Kris Gunsalus Lab, NYU)

• GBrowse• Synteny viewer

Page 48: Making best use of TAIR tools and datasets

N-Browse

Page 49: Making best use of TAIR tools and datasets

N-Browse: Finding information about edges (interactions)

Page 50: Making best use of TAIR tools and datasets

N-Browse: How to select and move nodes

Page 51: Making best use of TAIR tools and datasets

N-Browse: How to visualize GO terms from a selected set of nodes

Page 52: Making best use of TAIR tools and datasets

N-Browse: How to load your own file and overlay it with the curated interaction data

Page 53: Making best use of TAIR tools and datasets

N-Browse: How to save your session and export your data

Page 54: Making best use of TAIR tools and datasets

New Tools at TAIR

• N-Browse• GBrowse• Synteny viewer

Page 55: Making best use of TAIR tools and datasets

GBrowseHeader

Main Browser Window

Track Menu

Page 56: Making best use of TAIR tools and datasets

Alternative gene annotations• Eugene (transcript, proteins +) Thierry-Mieg (NCBI)

• Gnomon (transcript, proteins) Souvorov (NCBI)

• Aceview (transcript) Sebastien Aubourg

• Hanada et al 2007 (3633 predicted genes)Identify possible corrections

Page 57: Making best use of TAIR tools and datasets

Proteomic Data• High-density Arabidopsis proteome map (Baerenfaller.

2008)Incorrect start codon

Page 58: Making best use of TAIR tools and datasets

VISTA plot Gbrowse track

Page 59: Making best use of TAIR tools and datasets

Transcriptome data

Page 60: Making best use of TAIR tools and datasets

Orthologs and Gene Families

Page 61: Making best use of TAIR tools and datasets

Variation

Page 62: Making best use of TAIR tools and datasets

Promoter Elements

Page 63: Making best use of TAIR tools and datasets

Methylation

Page 64: Making best use of TAIR tools and datasets

Decorated Fasta file

Page 65: Making best use of TAIR tools and datasets

Decorated Fasta file

Page 66: Making best use of TAIR tools and datasets

Decorated Fasta file

Page 67: Making best use of TAIR tools and datasets

New Tools at TAIR

• N-Browse• GBrowse• Synteny viewer

Data provided by Pedro Pattyn at the University of Ghent

Page 68: Making best use of TAIR tools and datasets

AT5G48000

AT5G48010

AT5G47990

Page 69: Making best use of TAIR tools and datasets
Page 70: Making best use of TAIR tools and datasets

www.arabidopsis.org

[email protected]

www.arabidopsis.org/biocyc

[email protected]

www.plantcyc.org

[email protected]

Page 71: Making best use of TAIR tools and datasets

Acknowledgements

Curators:

- Peifen Zhang

- Tanya Berardini

- David Swarbreck

- Kate Dreher

- Rajkumar Sasidharan

Tech Team :- Bob Muller- Larry Ploetz- Raymond Chetty- Anjo Chi- Vanessa Kirkup- Cynthia Lee- Tom Meyer- Shanker Singh- Chris Wilks

AraCyc and TAIR

PI and Co-PIEva HualaSue Rhee

Metabolic Pathway Software:- Peter Karp and SRI group

Page 72: Making best use of TAIR tools and datasets
Page 73: Making best use of TAIR tools and datasets
Page 74: Making best use of TAIR tools and datasets
Page 75: Making best use of TAIR tools and datasets
Page 76: Making best use of TAIR tools and datasets
Page 77: Making best use of TAIR tools and datasets
Page 78: Making best use of TAIR tools and datasets
Page 79: Making best use of TAIR tools and datasets
Page 80: Making best use of TAIR tools and datasets
Page 81: Making best use of TAIR tools and datasets
Page 82: Making best use of TAIR tools and datasets

Automated pipeline at TAIRProgram for aligned sequence(PASA)

Clustered transcripts

Resulting gene model

Previous gene model

Based on a set of rules a decision is made

comparison

NCBI

Page 83: Making best use of TAIR tools and datasets

Gene structure annotation in Arabidopsis

NEW: 282 genes; 1056 exonsUPDATED: 1254 models; 1144 exons

NEW: 1291 genes; 683 exonsUPDATED: 3811 models; 4007 exons

NEW: 681 genes; 828 exonsUPDATED: 10,792 models and 14,050 exons

TAIR6

Page 84: Making best use of TAIR tools and datasets

How do MOD curators annotate genomes?

Experimental & Computational Evidence

Automatic pipeline

Manualannotation

Genome annotation

Page 85: Making best use of TAIR tools and datasets

How do MOD curators annotate genomes?

Experimental & Computational Evidence

Automatic pipeline

Manualannotation

Genome annotation

ESTs cDNAs

Page 86: Making best use of TAIR tools and datasets

How do MOD curators annotate genomes?

Experimental & Computational Evidence

Automatic pipeline

Manualannotation

Genome annotation

Page 87: Making best use of TAIR tools and datasets

How do MOD curators annotate genomes?

Experimental & Computational Evidence

Automatic pipeline

Manualannotation

Genome annotation

Page 88: Making best use of TAIR tools and datasets

How do MOD curators annotate genomes?

Experimental & Computational Evidence

Automatic pipeline

Manualannotation

Genome annotation

Alternative gene modelsShort MS peptidesCommunity submissions…

Page 89: Making best use of TAIR tools and datasets

Manual annotation at different MODs

Genomeediting

tool

Evidenceset

Set of annotation

rules+ +

Page 90: Making best use of TAIR tools and datasets

Manual annotation at different MODs

Genomeediting

tool

Evidenceset

Set of annotation

rules+ +

Nucleotide sequenceShort peptidesProtein similarityAlternative predictions…

Apollo (Arabidopsis, Fly)Aceview (Worm)Zmap/Otterlace (Human)Artemis (Pathogen Project)…

Exon sizeIntron sizeNumber of UTRsCoding/Non-coding ratioSplice-junctions…

Page 91: Making best use of TAIR tools and datasets

Responsibilities of a gene structure curator

ATG TGAGT GTAG AG

Delete wrongly predicted genes

Page 92: Making best use of TAIR tools and datasets

Responsibilities of a gene structure curator

ATG TGAGT GTAG AG

cDNA

Update mispredicted exon-intron structure

Page 93: Making best use of TAIR tools and datasets

Responsibilities of a gene structure curator

ATG TGAGT GTAG AG

cDNA

Update mispredicted exon-intron structure

Page 94: Making best use of TAIR tools and datasets

Responsibilities of a gene structure curator

ATG TGAGT GTAG AG

Annotate splice-variants

ATG TGAGT AG

Page 95: Making best use of TAIR tools and datasets

Responsibilities of a gene structure curator

Annotate ‘atypical’ gene classes

* * * ** * *

Trans. element

Short protein-coding genes

Transposable element genes

Pseudogenes

uORFs (genes within UTR of other genes)

Page 96: Making best use of TAIR tools and datasets

Responsibilities of a gene structure curator

ATG TGAGT GTAG AG

Define gene type

Protein-coding tRNA snRNA snoRNA rRNA…

Page 97: Making best use of TAIR tools and datasets

Categorisation/Review• 17,915 total gene models• Categorise/Prioritise (CDS length, Blast similarity, gene

confidence rank)

TAU Models

RNA-seq Junctions

Augustus Model

TAIR confidence rank

TAIR Model

Peptides

(Splice variants, NMD targets)

(correction)

(colour reflects matching model)

Incorrect junction in TAIR model

Unsupported exon

5

Page 98: Making best use of TAIR tools and datasets

Augustus

• RNA-seq Junctions = cluster reads

• Augustus Input: RNA-seq junctions, peptides, ESTs/cDNAs, TAIR models

• Provide evidence ranking and bonus scores

Junction assembly

Raw spliced RNA-seq reads (8,819,162 reads)

(203,317 Junctions)

Page 99: Making best use of TAIR tools and datasets

Examples of large-scale community datasets recently integrated into the Arabidopsis

annotation• Transposable elements (Quesneville Lab)• Pseudogenes (Gerstein Lab)• Short MS peptides (Baerenfaller et al,

Castellana et al)• Short genes (Hanada et al)

Page 100: Making best use of TAIR tools and datasets

Model Organism Databases

Page 101: Making best use of TAIR tools and datasets

Augustus- Results

4

Augustus models were classified into 4 categories:

Novel genes 20Updated genes 897Splice-variants 1826B-list 1173Rejects 3137

Page 102: Making best use of TAIR tools and datasets

Arabidopsis gene structure annotation A new approach

TAIR6-TAIR9: ESTs and cDNAs serve as main source of experimental data used for genome annotation

cDNA s & ESTs

Automated annotation

Annotated Arabidopsis genome

PASAProgram To Assemble

Spliced Alignments

Page 103: Making best use of TAIR tools and datasets

Arabidopsis gene structure annotation A new approach

TAIR6-TAIR9: ESTs and cDNAs serve as main source of experimental data used for genome annotation

cDNA s & ESTs

Automated annotation

Manualannotation

Annotated Arabidopsis genome

PASAProgram To Assemble

Spliced Alignments

Page 104: Making best use of TAIR tools and datasets

Arabidopsis gene structure annotation A new approach

TAIR6-TAIR9: ESTs and cDNAs serve as main source of experimental data used for Arabidopsis genome annotation

cDNA s & ESTs

Automated annotation

Annotated Arabidopsis genome

PASAProgram To Assemble

Spliced Alignments

Page 105: Making best use of TAIR tools and datasets

Arabidopsis gene structure annotation A new approach

TAIR6-TAIR9: ESTs and cDNAs serve as main source of experimental data used for genome annotation

cDNA s & ESTs

Automated annotation

Manualannotation

Annotated Arabidopsis genome

PASAProgram To Assemble

Spliced Alignments

Page 106: Making best use of TAIR tools and datasets

Arabidopsis gene structure annotation A new approach

TAIR6-TAIR9: ESTs and cDNAs serve as main source of experimental data used for genome annotation

cDNA s & ESTs

Automated annotation

Manualannotation

Annotated Arabidopsis genome

MS peptidesRNA-seq data

PASAProgram To Assemble

Spliced Alignments


Recommended