+ All Categories
Home > Documents > Gene prediction roderic guigó i serra IMIM/UPF/CRG.

Gene prediction roderic guigó i serra IMIM/UPF/CRG.

Date post: 17-Jan-2016
Category:
Upload: angel-quinn
View: 224 times
Download: 0 times
Share this document with a friend
Popular Tags:
25
gene prediction roderic guigó i serra IMIM/UPF/CRG
Transcript
Page 1: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

gene prediction

roderic guigó i serraIMIM/UPF/CRG

Page 2: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

number of genes in chromosome 22

• initial annotation 545 Dunham et al., 1999

• genscan+RT-PCR 590 Das et al., 2001

• genscan+microarrays 730 Shoemaker et al., 2001

• reviewed annotation 726 chr22 team, sanger, 2001

• mouse shotgun data +20 (our data)

• geneid predictions 794

• genscan predictions 1128

Page 3: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

number of genes in human genome

• Consortium 30.000-40.000 2001

• Celera 27.000-38.000 2001

• Consortium+Celera 50.000 Hogenesch et al.

2001

• DBsearches 65.000-75.000 Wrigth et al., 2001

• HumanGenomeSciences 90.000-120.000 Haseltine,

2001

Page 4: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

decodificació del genomaACTCAGCCCCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGAAGCGCAGTCGGGGGCACGGGGATGAGCTCAGGGGCCTCTAGAAAGATGTAGCTGGGACCTCGGGAAGCCCTGGCCTCCAGGTAGTCTCAGGAGAGCTACTCAGGGTCGGGCTTGGGGAGAGGAGGAGCGGGGGTGAGGCCAGCAGCAGGGGACTGGACCTGGGAAGGGCTGGGCAGCAGAGACGACCCGACCCGCTAGAAGGTGGGGTGGGGAGAGCATGTGGACTAGGAGCTAAGCCACAGCAGGACCCCCACGAGTTGTCACTGTCATTTATCGAGCACCTACTGGGTGTCCCCAGTGTCCTCAGATCTCCATAACTGGGAAGCCAGGGGCAGCGACACGGTAGCTAGCCGTCGATTGGAGAACTTTAAAATGAGGACTGAATTAGCTCATAAATGGAAAACGGCGCTTAAATGTGAGGTTAGAGCTTAGAATGTGAAGGGAGAATGAGGAATGCGAGACTGGGACTGAGATGGAACCGGCGGTGGGGAGGGGGAGGGGGTGTGGAATTTGAACCCCGGGAGAGAAAGATGGAATTTTGGCTATGGAGGCCGACCTGGGGATGGGGAAATAAGAGAAGACCAGGAGGGAGTTAAATAGGGAATGGGTTGGGGGCGGCTTGGTAACTGTTTGTGCTGGGATTAGGCTGTTGCAGATAATGGAGCAAGGCTTGGAAGGCTAACCTGGGGTGGGGCCGGGTTGGGGTCGGGCTGGGGGCGGGAGGAGTCCTCACTGGCGGTTGATTGACAGTTTCTCCTTCCCCAGACTGGCCAATCACAGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATTCCTGGCAGGTATGGGGCGGGGCTTGCTCGGTTTTCCCCGCTTCTCCCCCTCTCATCCTCACCTCAACCTCCTGGCCCCATTCAAGCACACCCTGGGCCCCCTCTTCTTCTGCTGGTCTGTCCCCTGAGGGGAAAGCCCAGGTCTGAGGCTTCTATGCTGCTTTCTGGCTCAGAACAGCGATTTGACGCTCTGTGAGCCTCGGTTCCTCCCCCGCTTTTTTTTTTTCAGCCAGAGTCTCACTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCAATCTCAGCTCACTGCAAGCTCCGCCTCCCGGGTTCACGCTATTCTCCCGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCGCCCGCCACCATGCCCGGCTAATTTTTTGTACTTTGAGTAGGGAAGGGGTTTCACTGTATTATCCAGGATGGTCTCTATCTCCTGACCTCGTGATCTGCCCGCCTGGCCTCCCAAAGTGCTGGAATTACAGGCGTGAGCCTCCGCGCCCGGCCTCCCCATCCTTAATATAGGAGTTAGAAGTTTTTGTTTGTTTGTTTTGTTTTGTTTTTGTTTTGTTTTGAGATGAAGTCCCTCTGTCGCCCAGGCTGGAGTGCAGTGGCTCCCAGGCTGGAGTTCAGTGGCTGGATCTCGGCTCACTGCAAGCTCCGCCTCCCAGGTTCACGCCATTCTCCTGCCTCAGCCTCCGGAGTAGCTGGGACTACAGGAACATGCCACCACACCCGACTAACTTTTTTTGTATTTTTAGTAGAGACGGGGTTTCACCATGTTGGCCAGGCTGGTCTGGAACTCCTGACCTCAGGTGATCTGCCTGCTTCAACCTCCCAAAGTGCTGGGATTACAGACGTGGGCCACCGCGCCCGGCTGGGAGTTAAGAGGTTTCTAATGCATTGCATTAGAATACCAGACACGGGACAGCTGTGATCTTTATTCTCCATCACCCCACACAGCCCTGCCTGGGGCACACAAGGACACTCAATACACGCTTTTCGGGCGCGGTGGCTCAAGCTGTAATCCCAGCACTTTGGGAGGCTGAGGCGGGTGGTACATGAGGTCAGGAGATCGAGACCATCCTGGCTAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAAACTAGCCCGGGCGTGGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGAGGCTGAGGCAGGAGAATGGCGTGAACCTGGGAGGCGGAGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGTGACACAGCGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATACACGCTTTTCCGCTAGGCACGGTGGCTCACCCCTGTAATCCCAGCATTTTGGGAGGCCAAGGTGGGAGGATCACTTGAGCCCAGGAGTTCAACACCAGACTCAGCAACATAGTGAGACTCTCTCTACTAAAAATACAAAAATTAGCCAGGCCTGGTGCCACACACCTGTGGTCCCAGCTACTCAGAAGGCTAAGGCAGGAGGATCGCTTAAGCCCAGAAGGTCAAGGTTGCAGTGAACCACGTTCAGGCCACTGCAGTCCAGCCTGGGTGACAGAGCAAGACCCTGTCTGTAAATAAATAACGCTTTTCAAGTGATTAAACAGACTCCCCCCTCACCCTGCCCACCATGGCTCCAAAGCAGCATTTGTGGAGCACCTTCTGTGTGCCCCTAGGTACTAGCTGCCTGGACGGGGTCAGAAGGAACCTGAACCACCTTCAACTTGTTCCACACAGGATGCCAGGCCAAGGTGGAGCAACCGGTGGAGCCAGAGACAGAACCCGACGTTCGCCAGCAGGCTGAGTGGCAGAGCGGCCAGCCCTGGGAGCTGGCACTGGGTCGCTTTTGGGATTACCTGCGCTGGGTGCAGACACTGTCTGAGCAGGTGCAGGAGGAGCTGCTCAGCCCCCAGGTCACCCAGGAACTGACGTGAGTGTCCCCATCCCGGCCCTTGACCCTCCTGGTGGGCGGCTATACCTCCCCAGGTCCAGGTTTCATTCTGCCCCTGCCACTAAGTCTTGGGGGCCTGGGTCTCTGCTGGTTCTAGCTTCCTCTTCCCATTTCTGACTCCTGGCTTTAGCTCTCTGGAATTCTCTCTCTCAGTTCTGTTTCTCCCTCTTCCCTTCTGACTCAGCCTGTCACACTCGTCCTGGCGCTGTCTCTGTCCTTCACTAGCTCTTTTATATAGAGACAGAGAGATGGGGTCTCACTGTGTTGCCCAGGCTGGTCTTGAACTTCTGGGCTCAAGCGATCCTCCCACCTCGCCTCCCAAAGTGCTGGGAATAGAGACATGAGCCACCTTGCTCGGCCTCCTAGCTCTTTCTTCGTCTCTGCCTCTGCTCTCTGCGTCTGTCTTTGTCTCCTCTCTGCCTCTGTCCCGTTCCTTCTCTCTTGGTTCACTGCCCTTCTGTCTCTCCCTGTTCTCCTTAGGAGACTCTCCTCTCTTCCTTCTCGAGTCTCTCTGGCTGATCCCCATCTCACCCACACCTATCC

the human genome sequence

Page 5: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAEKMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNPTNTIVYFGRYWSP

the amino acid sequence of the proteins

Page 6: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

EXONS

INTRONS

ELEMENTREGULADOR‘UPSTREAM’

ELEMENTREGULADOR

‘DOWNSTREAM’

PROMOTOR

Estructura dels Gens

Page 7: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

Del DNA al RNA

Page 8: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

Del RNA a la Proteïna

Page 9: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

Mecanisme Molecular

Page 10: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

Prediction of splice sites

Page 11: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

accuracy of gene prediction programs

Page 12: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

accuracy of gene prediction programs

Page 13: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

accuracy of gene prediction programs

Page 14: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

• rosseta (Batzoglou et al., 2000)

• cem (Bafna and Huson, 2000)

• sgp1 (Wiehe et al., 2000)

• twinscan (Korf et al., 2001)

• slam ( Patcher et al., 2001)

• sgp2 (Guigó et al., in preparation)

comparative gene prediciton

Page 15: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

QuerySequence

tblastxHSPs

geneidExons

HSPsProjectio

ns

SGPExons

syntenic gene prediction (sgp2)

Page 16: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

benchmarking sgp2 - accuracy

scimog

mit

Page 17: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

Predicting “novel” genes in the human genome

golden path annotationsgolden path annotationsadditional blastn matches to ENSEMBL + REFSEQadditional blastn matches to ENSEMBL + REFSEQtblastx

geneidexons

tblastx

sgpgenes

Golden Path Oct 7, 2000 freeze. RepeatMaskedTraceDB, as on February 2001

Page 18: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

“novel” genes ?

• 48,890 genic regions (known genes or similar)

• 15,489 genes longer than 100 aa predicted by sgp

• 13,302 non redundant predictions

• 8,416 supported by tblastx hits to mouse 1.5

• 3,331 predicted genes with at least two exons suported by tblastx hits

• + 719 predicted genes supported by tblastx hits covering at least 75% of the prediction

4,050 supported sgp predictions

25% of them not overlapping genscan predictions

Page 19: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

validation of predictions

EST identity 18%

NR similarity 31%

CDD (NCBI) 24%

Mouse ESTs 28%

Rat ESTs 19%

Tetraodon 15%

at least one of the above

56%

Page 20: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

Experimental validation

Page 21: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

chr22

chr21

human genome vs. Mouse traceDB

Page 22: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

SN SP CC SNe SPe SNSP ME WE

chr22.assem. 0.87 0.65 0.75 0.69 0.54 0.62 0.14 0.33

chr22.shot. 0.82 0.66 0.72 0.63 0.54 0.58 0.20 0.31

human genome vs. Mouse assemblies

Page 23: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

chr22 chr21

776 Predicted 420

-655 known -326

-25 low complexity -5

-26 short -11

-19 intronless -34

45 36

testing novel predictions experimentally

In total 81 predictions. For 40 of them, adjacent exon pairs were selected for rt-pcr

Page 24: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

Positive controls N Success rate

refseq 78 96%

Known tissue specific genes

20 25%

Low expressing genes

13 Not ready

Twinscan with EST support

Not ready

Test sets

Twinscan Not ready

SGP 40 28%

preliminary results

Page 25: Gene prediction roderic guigó i serra IMIM/UPF/CRG.

aknowledgments

IMIM-UPF-CRG, Barcelona

• Josep F. Abril

• Genís Parra

• Roderic Guigó

GlaxoSmithKline, King of Prussia

• Pankaj Agarwal

Max Plank Institute for Chemical Ecology, Jena

• Thomas Wiehe

Whitehead Institute/MIT Center for Genome Research, Cambridge

• Gwen Acton

• Dan Brown

• Kerstin

Mouse Sequence Consortium


Recommended