+ All Categories
Home > Documents > Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur...

Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur...

Date post: 22-Dec-2015
Category:
View: 217 times
Download: 3 times
Share this document with a friend
Popular Tags:
30
Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea
Transcript
Page 1: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

Experiences and suggestions for the annotation of tomato BAC clones

2005-09-28

Dr. Cheol-Goo Hur

Plant Genome Lab.

Genome Research Center

KRIBB, Korea

Page 2: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

Contents

• Phase-I Annotation

• Define gene structures

• Sample Annotations

• Future Works

• Acknowledgements

Page 3: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

Phase-I Annotation

Target AnalysisTools / Data

SGN Guideline*1 KRIBB

Protein Coding Genes

Computational Gene Prediction

GeneMark.hmm, FGENESH, GlimmerM, GENSCAN+, Eugene

FGENESH (N.tabacuum)

Experimental Gene Identification

GeneSeqer, SIM4, BLAST(Tomato cDNAs, ESTs, unigenes)

BLAT, SIM4, GMAP, GeneSeqer(dbEST, GenBank mRNAs),GeneWise (GenPept Proteins)

Resolution of Conflict PASA, GeneSeqer (Automatic)Apollo Genome Viewer (Manual)

Combined Modeller (Automatic)*2

Apollo Genome Viewer (Manual)

tRNA Computational tRNA Prediction

tRNAscan-SE tRNAscan-SE

Other RNAs

Similarity-based RNA Identification(microRNAs, snoRNAs)

- Cross-match(GenBank rRNA, Rfam)

Repeats Repeat Scanning - RepeatMasker/Cross-match(RepBase/TIGR Plant Repeats)

*1. Version 0.9 March 31, 2005

*2. KRIBB

Page 4: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

Functional Annotation

Target AnalysisTools / Data

SGN Guideline KRIBB

Function of

Protein Coding Genes

Conserved Functional Domains*1

InterProScan(InterPro Databases)

InterProScan (InterPro Databases)

Homology to Proteins*1,2

BLASTx(Arabidopsis, rice, Medicago, Swiss-Prot, GenBank nr)

BLASTx(UniProt)

Gene Ontology assignment

- BLASTx

(Arabidopsis Proteins associated with GOA, TAIR GO data) *3

EC/Pathway - BLASTx

(Arabidopsis*3 Proteins associated with KEGG EC/Pathway data)

Protein Location Predictions

Transmembrane Domains (TMHMM), Subcellular Location(TargetP)

Transmembrane Domains (TMHMM), Subcellular Location(TargetP)

*1. Automated annotation should be converted to GO code for easier comparisons.*2. Classify into 5 class of gene annotations based on seqeunce similarity and availablity of expression data. (known / putative / similar to/ expressed / no evidence)*3. Use Arabidopsis Full Protein set to maximize the number of GO assigned genes.

Page 5: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

Data Set for gene structure and annotation (Aug. 2005)

• BACs – Sequenced: 29 (4 BACs overlapping in 2 pairs)– Annotated: 22

• ESTs : 200 015 (cf: Potato 193 233, Pepper 115 598)• Full-length mRNAs (GenBank): 596• Full-length Proteins (UniProt 5.1): 1 044• Protein DB (UniProt Release 5.1)

– Swiss-PROT/TREMBL: 181 821 / 1 748 002• Arabidopsis Proteins

– GO associated (TAIR): 26 196– Pathway/EC associated (KEGG): 1 520

Page 6: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

Defining the Gene Structure

• New Genomes, New Challenges: lack of data

• To get best performance with given data, well-combined method is needed

– Combine experimental data-based gene models

– Extend the gene boundary and make up for the missing parts with predicted gene models

– Final manual curation

• Ex) EuGene for Medicago Genome Annotation

Page 7: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

Structure of Protein Coding Genes

Transcripts(AlternativeSplicedForms

(ESTs)

PCpG

TSS

TIS

ATG(Met)

Stop

Poly-A Site

TAATAGTGA AATAAAGT---AG

IntronSplicingSignal

CDS

Page 8: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

1. Define gene structure by various data evidences

• Full-length evidenced genes (mRNAs / Proteins)

• Full-length clue evidenced genes (Full-length clue ESTs from Kazusa full-length cDNA library)

• Partially evidenced genes (Other partial ESTs)

• No-evidenced genes (Prediction only)

PredictmRNAProtein

PredictEST

Page 9: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

1) Full-length Evidenced Genes

• Gene locus with full-length mRNA / Protein (GMAP, GeneWise)• Almost complete gene structure: Gene boundary (mRNA:TSS/poly-A,

protein:CDS), Exon/Intron, (some alternative splicing structure)• Requirement: more than 1 mRNA or Proteins• Processing:

– Merge the same AS forms– mRNA evidence: Predict CDS (ESTscan etc.)– Protein evidence: Mend gene boundary(TSS, poly-A)

mRNA

Protein

Predict

Page 10: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

2) Full-length Clue Evidenced Genes

• Gene locus with full-length clue ESTs from Kazusa full-length cDNA library (GMAP)

• Gene boundary(TSS, poly-A), some Exon/Intron• Requirement: more than 1 full-length clue ESTs• Processing:

– Merge the same AS forms– Link the same-cloned ESTs– Mend uncomplete portion with predicted model– CDS to be predicted (ESTscan / orfPredictor etc.)

EST

Predict

Page 11: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

3) Partially Evidenced Genes

• Gene locus with general ESTs (GMAP)• Some Exon/Intron, poly-A• More ESTs, more information expected• Requirement: more than 2 ESTs with more than 2 couples

of overlapped hard-edges• Processing:

– Merge the same AS forms– Link the same-cloned ESTs– Mend incomplete portion with predicted model– CDS to be predicted (ESTscan/orfPredictor etc.)

EST1

Predict

EST2

Page 12: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

4) No-evidenced Genes

• Predicted model only (hypothetical gene)

• Predicted CDS

Predict

Page 13: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

• Test BLAT/SIM4/GMAP/GeneSeqer– BLAT – Fast/Unaccurate– SIM4/GMAP/GeneSeqer – Approx. the Same results

• KRIBB: Prefiltering ESTs by BLAT + GMAPing• Cutoff: Coverage > 80%, Identity > 92%

2. Transcript-Genome mappers

Page 14: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

Problem of Repeat and Similarity? Or miss assembly?

Page 15: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.
Page 16: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

Similarity cutoff needed

Page 17: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

3. Protein-based Gene Models

• GeneWise / FGENESH+

• KRIBB: GeneWise after prefiltering Proteins by BLASTx – BLASTx Cutoff: Coverage>80%, Identity>80%

Page 18: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

Sample Annotations: define gene structure and annotation

Page 19: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

1) Full-length Evidenced Gene: C02HBa0025N15.220

• mRNA/Protein evidence

• Annotation

– Product: SNF1 [Lycopersicon esculentum]

– IPR000719 Prot_kinase

– GO:0006468(P) protein amino acid phosphorylation

– GO:0004672(F) protein kinase activity

– EC:2.7.1.-: Snf1-related protein kinase (KIN10) (SKIN10)

– TMHMM: outside

Page 20: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

2) Full-length Evidenced Gene: C02HBa0066C13.60

• Protein evidence• Annotation

– Product: phytochrome E [Lycopersicon esculentum]– IPR001294 Phytochrome– GO:0006355(P) regulation of transcription, DNA-dependent – GO:0008020(F) G-protein coupled photoreceptor activity – TargetP/TMHMM: C/outside– FunCat: 30.01 intracellular signalling

70.01 cell wall

Page 21: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

3) Full-length Clue Evidenced Gene: C02HBa0060J03.170

• Kazusa full-length cDNA/EST evidence• Annotation

– Product: putative protein [Arabidopsis thaliana]– IPR001251: CRAL_bd_TRIO_C– TMHMM: outside

~1Kb

3 Exon

Page 22: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

4) Partially Evidenced Gene: C02HBa0060J03.90

• EST evidence• Annotation

– Product: putative protein [Arabidopsis thaliana]– IPR000719 Prot_kinase – GO:0006468(P) protein amino acid phosphorylation – GO:0004672(F) protein kinase activity – GO:0016020(C) membrane– TMHMM: outside

Page 23: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

5) Gene with alternative splicing: C02HBa0060J03.40-4

• EST evidence• Annotation

– Product: transformer-SR ribonucleoprotein [N.tabacum]– IPR000504 RNA-binding region RNP-1– GO:0003676(F) nucleic acid binding – GO:0030529(C) ribonucleoprotein complex – TargetP/TMHMM: C/outside

Page 24: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.
Page 25: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.
Page 26: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.
Page 27: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

Annotation Results

Property Value Unit

BAC (Annotated/Sequenced)

Length (Average/Total)

22 / 29

122 / 2698

BAC

kb

Putative Protein CDSs

Gene Density

Gene Length, Average

Exon Length, Average

Exons per Gene, Average

With ESTs

Protein Annotated

Domain Annotated

GO Annotated

Pathway Annotated

EC Annotated

620

4.6

3.1

272

7.3

352(57%)

446(72%)

424(68%)

338(55%)

25( 4%)

29( 5%)

gene

kb/gene

kb

bp

exon/gene

gene

gene

gene

gene

gene

gene

tRNA 13 gene

Repeats 144(5.3%) kb

*1. All values from annotated 22 BACs.

Page 28: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

Future Works

• Training data set for Tomato gene HMM models• Automation• Performance assessment• Manual curation (Apollo)

Page 29: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

Tool Author Source

BLAT Jim Kent UCSC (http://www.cse.ucsc.edu/~kent/)

FGENESH Solovyev, et al. SoftBerry, Inc. (http://www.softberry.com/)

GMAP Thomas D.Wu, Colin K. Watanabe

Genentech, Inc., (http://www.gene.com/share/gmap/)

GeneSeqer V. Brendel, et al. Iowa State/Stanford University (http://bioinformatics. iastate.edu/bioinformatics2go/gs/download.html)

GeneWise Ewan Birney EBI (http://www.ebi.ac.uk/Wise2/)

InterProScan EBI(http://www.ebi.ac.uk/InterProScan/)

Miropeats Parsons J.D. Washington University (http://genomeold.wustl.edu/groups/ informatics/software/miropeats/)

BLAST(NCBI) S.F. Altschul NCBI (http://www.ncbi.nlm.nih.gov/blast/)

Phred/phrap/cross_match Phil Green University of Washington (http://www.phrap.org/)

RepeatMasker Arian Smit, P. Green

(http://www.repeatmasker.org/)

SIM4 Liliana Florea et al. PennState University (http://globin.cse.psu.edu/)

TargetP Olof Emanuelsson, et al.

CBS in Technical University of Denmark (http://www.cbs.dtu.dk/services/TargetP/)

TMHMM A. Krogh, et al. CBS in Technical University of Denmark (http://www.cbs.dtu.dk/services/TMHMM/)

tRNAscan-SE T.M. Lowe, S.R. Eddy

University of Washington(http://www.genetics.wustl.edu/eddy/tRNAscan-SE/)

Acknowledgement

Page 30: Experiences and suggestions for the annotation of tomato BAC clones 2005-09-28 Dr. Cheol-Goo Hur Plant Genome Lab. Genome Research Center KRIBB, Korea.

http://sol.kribb.re.kr

Thanks you for your attention!


Recommended