Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur...

Post on 22-Dec-2015

217 views 3 download

Tags:

transcript

Experiences and suggestions for the annotation of tomato BAC clones

2005-09-28

Dr. Cheol-Goo Hur

Plant Genome Lab.

Genome Research Center

KRIBB, Korea

Contents

• Phase-I Annotation

• Define gene structures

• Sample Annotations

• Future Works

• Acknowledgements

Phase-I Annotation

Target AnalysisTools / Data

SGN Guideline*1 KRIBB

Protein Coding Genes

Computational Gene Prediction

GeneMark.hmm, FGENESH, GlimmerM, GENSCAN+, Eugene

FGENESH (N.tabacuum)

Experimental Gene Identification

GeneSeqer, SIM4, BLAST(Tomato cDNAs, ESTs, unigenes)

BLAT, SIM4, GMAP, GeneSeqer(dbEST, GenBank mRNAs),GeneWise (GenPept Proteins)

Resolution of Conflict PASA, GeneSeqer (Automatic)Apollo Genome Viewer (Manual)

Combined Modeller (Automatic)*2

Apollo Genome Viewer (Manual)

tRNA Computational tRNA Prediction

tRNAscan-SE tRNAscan-SE

Other RNAs

Similarity-based RNA Identification(microRNAs, snoRNAs)

- Cross-match(GenBank rRNA, Rfam)

Repeats Repeat Scanning - RepeatMasker/Cross-match(RepBase/TIGR Plant Repeats)

*1. Version 0.9 March 31, 2005

*2. KRIBB

Functional Annotation

Target AnalysisTools / Data

SGN Guideline KRIBB

Function of

Protein Coding Genes

Conserved Functional Domains*1

InterProScan(InterPro Databases)

InterProScan (InterPro Databases)

Homology to Proteins*1,2

BLASTx(Arabidopsis, rice, Medicago, Swiss-Prot, GenBank nr)

BLASTx(UniProt)

Gene Ontology assignment

- BLASTx

(Arabidopsis Proteins associated with GOA, TAIR GO data) *3

EC/Pathway - BLASTx

(Arabidopsis*3 Proteins associated with KEGG EC/Pathway data)

Protein Location Predictions

Transmembrane Domains (TMHMM), Subcellular Location(TargetP)

Transmembrane Domains (TMHMM), Subcellular Location(TargetP)

*1. Automated annotation should be converted to GO code for easier comparisons.*2. Classify into 5 class of gene annotations based on seqeunce similarity and availablity of expression data. (known / putative / similar to/ expressed / no evidence)*3. Use Arabidopsis Full Protein set to maximize the number of GO assigned genes.

Data Set for gene structure and annotation (Aug. 2005)

• BACs – Sequenced: 29 (4 BACs overlapping in 2 pairs)– Annotated: 22

• ESTs : 200 015 (cf: Potato 193 233, Pepper 115 598)• Full-length mRNAs (GenBank): 596• Full-length Proteins (UniProt 5.1): 1 044• Protein DB (UniProt Release 5.1)

– Swiss-PROT/TREMBL: 181 821 / 1 748 002• Arabidopsis Proteins

– GO associated (TAIR): 26 196– Pathway/EC associated (KEGG): 1 520

Defining the Gene Structure

• New Genomes, New Challenges: lack of data

• To get best performance with given data, well-combined method is needed

– Combine experimental data-based gene models

– Extend the gene boundary and make up for the missing parts with predicted gene models

– Final manual curation

• Ex) EuGene for Medicago Genome Annotation

Structure of Protein Coding Genes

Transcripts(AlternativeSplicedForms

(ESTs)

PCpG

TSS

TIS

ATG(Met)

Stop

Poly-A Site

TAATAGTGA AATAAAGT---AG

IntronSplicingSignal

CDS

1. Define gene structure by various data evidences

• Full-length evidenced genes (mRNAs / Proteins)

• Full-length clue evidenced genes (Full-length clue ESTs from Kazusa full-length cDNA library)

• Partially evidenced genes (Other partial ESTs)

• No-evidenced genes (Prediction only)

PredictmRNAProtein

PredictEST

1) Full-length Evidenced Genes

• Gene locus with full-length mRNA / Protein (GMAP, GeneWise)• Almost complete gene structure: Gene boundary (mRNA:TSS/poly-A,

protein:CDS), Exon/Intron, (some alternative splicing structure)• Requirement: more than 1 mRNA or Proteins• Processing:

– Merge the same AS forms– mRNA evidence: Predict CDS (ESTscan etc.)– Protein evidence: Mend gene boundary(TSS, poly-A)

mRNA

Protein

Predict

2) Full-length Clue Evidenced Genes

• Gene locus with full-length clue ESTs from Kazusa full-length cDNA library (GMAP)

• Gene boundary(TSS, poly-A), some Exon/Intron• Requirement: more than 1 full-length clue ESTs• Processing:

– Merge the same AS forms– Link the same-cloned ESTs– Mend uncomplete portion with predicted model– CDS to be predicted (ESTscan / orfPredictor etc.)

EST

Predict

3) Partially Evidenced Genes

• Gene locus with general ESTs (GMAP)• Some Exon/Intron, poly-A• More ESTs, more information expected• Requirement: more than 2 ESTs with more than 2 couples

of overlapped hard-edges• Processing:

– Merge the same AS forms– Link the same-cloned ESTs– Mend incomplete portion with predicted model– CDS to be predicted (ESTscan/orfPredictor etc.)

EST1

Predict

EST2

4) No-evidenced Genes

• Predicted model only (hypothetical gene)

• Predicted CDS

Predict

• Test BLAT/SIM4/GMAP/GeneSeqer– BLAT – Fast/Unaccurate– SIM4/GMAP/GeneSeqer – Approx. the Same results

• KRIBB: Prefiltering ESTs by BLAT + GMAPing• Cutoff: Coverage > 80%, Identity > 92%

2. Transcript-Genome mappers

Problem of Repeat and Similarity? Or miss assembly?

Similarity cutoff needed

3. Protein-based Gene Models

• GeneWise / FGENESH+

• KRIBB: GeneWise after prefiltering Proteins by BLASTx – BLASTx Cutoff: Coverage>80%, Identity>80%

Sample Annotations: define gene structure and annotation

1) Full-length Evidenced Gene: C02HBa0025N15.220

• mRNA/Protein evidence

• Annotation

– Product: SNF1 [Lycopersicon esculentum]

– IPR000719 Prot_kinase

– GO:0006468(P) protein amino acid phosphorylation

– GO:0004672(F) protein kinase activity

– EC:2.7.1.-: Snf1-related protein kinase (KIN10) (SKIN10)

– TMHMM: outside

2) Full-length Evidenced Gene: C02HBa0066C13.60

• Protein evidence• Annotation

– Product: phytochrome E [Lycopersicon esculentum]– IPR001294 Phytochrome– GO:0006355(P) regulation of transcription, DNA-dependent – GO:0008020(F) G-protein coupled photoreceptor activity – TargetP/TMHMM: C/outside– FunCat: 30.01 intracellular signalling

70.01 cell wall

3) Full-length Clue Evidenced Gene: C02HBa0060J03.170

• Kazusa full-length cDNA/EST evidence• Annotation

– Product: putative protein [Arabidopsis thaliana]– IPR001251: CRAL_bd_TRIO_C– TMHMM: outside

~1Kb

3 Exon

4) Partially Evidenced Gene: C02HBa0060J03.90

• EST evidence• Annotation

– Product: putative protein [Arabidopsis thaliana]– IPR000719 Prot_kinase – GO:0006468(P) protein amino acid phosphorylation – GO:0004672(F) protein kinase activity – GO:0016020(C) membrane– TMHMM: outside

5) Gene with alternative splicing: C02HBa0060J03.40-4

• EST evidence• Annotation

– Product: transformer-SR ribonucleoprotein [N.tabacum]– IPR000504 RNA-binding region RNP-1– GO:0003676(F) nucleic acid binding – GO:0030529(C) ribonucleoprotein complex – TargetP/TMHMM: C/outside

Annotation Results

Property Value Unit

BAC (Annotated/Sequenced)

Length (Average/Total)

22 / 29

122 / 2698

BAC

kb

Putative Protein CDSs

Gene Density

Gene Length, Average

Exon Length, Average

Exons per Gene, Average

With ESTs

Protein Annotated

Domain Annotated

GO Annotated

Pathway Annotated

EC Annotated

620

4.6

3.1

272

7.3

352(57%)

446(72%)

424(68%)

338(55%)

25( 4%)

29( 5%)

gene

kb/gene

kb

bp

exon/gene

gene

gene

gene

gene

gene

gene

tRNA 13 gene

Repeats 144(5.3%) kb

*1. All values from annotated 22 BACs.

Future Works

• Training data set for Tomato gene HMM models• Automation• Performance assessment• Manual curation (Apollo)

Tool Author Source

BLAT Jim Kent UCSC (http://www.cse.ucsc.edu/~kent/)

FGENESH Solovyev, et al. SoftBerry, Inc. (http://www.softberry.com/)

GMAP Thomas D.Wu, Colin K. Watanabe

Genentech, Inc., (http://www.gene.com/share/gmap/)

GeneSeqer V. Brendel, et al. Iowa State/Stanford University (http://bioinformatics. iastate.edu/bioinformatics2go/gs/download.html)

GeneWise Ewan Birney EBI (http://www.ebi.ac.uk/Wise2/)

InterProScan EBI(http://www.ebi.ac.uk/InterProScan/)

Miropeats Parsons J.D. Washington University (http://genomeold.wustl.edu/groups/ informatics/software/miropeats/)

BLAST(NCBI) S.F. Altschul NCBI (http://www.ncbi.nlm.nih.gov/blast/)

Phred/phrap/cross_match Phil Green University of Washington (http://www.phrap.org/)

RepeatMasker Arian Smit, P. Green

(http://www.repeatmasker.org/)

SIM4 Liliana Florea et al. PennState University (http://globin.cse.psu.edu/)

TargetP Olof Emanuelsson, et al.

CBS in Technical University of Denmark (http://www.cbs.dtu.dk/services/TargetP/)

TMHMM A. Krogh, et al. CBS in Technical University of Denmark (http://www.cbs.dtu.dk/services/TMHMM/)

tRNAscan-SE T.M. Lowe, S.R. Eddy

University of Washington(http://www.genetics.wustl.edu/eddy/tRNAscan-SE/)

Acknowledgement

http://sol.kribb.re.kr

Thanks you for your attention!