Post on 13-Jan-2016
description
transcript
www.PHYTOME.org
a plant comparative genomics resource
Todd Vision,
Jason Phillips, Dihui Lu, Stefanie Hartmann
Outline of today’s presentation
1. What kind of data is stored in Phytome - and how did we generate this data?
2. How can you search Phytome?
3. What kind of results will Phytome give you?
Phytome integrates
organismal phylogeny
gene family information: sequencesalignmentsphylogenies
genetic and physical maps
Phytome: applications
Starting with a gene family resolve orthology/paralogy relationships identify coevolving families
Starting with a species explore lineage-specific diversification guide comparative mapping bench-work
Starting with a chromosome segment identify homologous segments predict unobserved gene content (candidate
QTL)
overview of the pipeline
EST - expressed sequence tags
• are partial sequences of expressed genes• are error-prone, contain sequence or frame shift errors• are very useful for discovering new genes,
provide data on gene expression, make up much of the sequence data
EST contig assemblies• contigs: continuous sequences of multiple overlapping ESTs• singletons: don’t match other ESTs in the dataset
sources
• TIGR, Plant GDB, NCBI, TAIR, Sputnik, Plant Genome Network;
• for each species, we used the source with the largest number of EST
DNA pre-RNA mRNA cDNA cDNA clone
protein
data aquisition
data acquisition/organismal phylogenies
Glycine max
Phaseolus coccineus
Lotus corniculatus
Medicago truncatula
Cucumis sativus
Prunus persica
Populus tremula x tremuloides
Arabidopsis thaliana
Brassica napus
Gossypium hirsutum
Theobroma cacao
Citrus sinensis
Vitis vinifera
Lycopersicon esculentum
Solanum tuberosum
Capsicum annuum
Nicotiana benthamiana
Helianthus annuus
Zinnia elegans
Stevia rebaudiana
Lactuca sativa
Beta vulgaris
Mesembryanthemum crystallinum
Eschscholzia californica
Hordeum vulgare
Triticum aestivum
Secale cerealeAvena sativa
Zea mays
Sorghum bicolor
Oryza sativa
Allium cepa
Amborella trichopoda
Cryptomeria japonica
Pinus taeda
Cycas rumphii
Ceratopteris richardii
Marchantia polymorpha
Physcomitrella patens
core eudicotseudicotyledons
cycad
conifers
moss
fern
rosids
asterids
Liliopsida
Angiosperms
liverwort
Saccharum officinarum
protein sequence prediction
from EST contigs to peptide sequences: ESTwise
•translate cDNA sequence (ESTs) in all reading frames•compare the translated DNA to a database of known proteins
(Swiss-Prot, TrEMBL)•use this information for gene prediction/translation•correct frame shift errors based on the homology information
protein TVKKAHFEKWGNIVDVDYFQHFGNIVDINIVIDKETGKKRGFAFVEFDDYDPVDKVVLQKQHQLNGKMVDV TVK++HF +WG + D DYF+ +G I I I+ D+ +GKKRGF FV FD +D VDK+V+QK H +NG +V TVKRSHFxQWGTLTDCDYFEQYGKIEVIEIMTDRGSGKKRGF!FVTFDGHDSVDKIVIQKYHTVNGHNxEV EST agaaactNctgacagtgttgctgaaggagaaagcgagaaagt2tgatggcgtggaagacatcagagcatgg ctaggataaggctcagaataaagatattattcaggggaaggt ttctagaactaatttaaaactagaaNat tgagcttgagagcgcttttagtaatagtacgtcactcgagct tactcctccgtgtctgacttgtccctat
protein family clustering(Tribe-MCL)
input: • a set of proteins • BLAST-all vs. BLAST-all values
method: • construct weighted graph • convert into Markov matrix • expansion repeat until matrix • inflation doesn’t change
output: • clusters of related proteins: protein families
protein family clustering(Tribe-MCL)
input: • a set of proteins • BLAST-all vs. BLAST-all values
method: • construct weighted graph • convert into Markov matrix • expansion repeat until matrix • inflation doesn’t change
output: • clusters of related proteins: protein families
image taken from the MCL homepage: http://micans.org/mcl/
protein family clustering(Tribe-MCL)
multiple sequence alignment
testedprogram quality speed algorithmClustalW + ++ progressiveMafft i ++ + iterativeMafft p ++ +++ progressiveT-Coffee +++ memory! consistency-based/progressiveDialign +++ time! consistency based
progressive sequence alignment: 1. generate pairwise distances from a multiple alignment2. use distances to construct a guide tree3. start by aligning the most similar sequences4. progressively add more sequences to the existing
alignment
multiple sequence alignment
1. identification of homologous proteins, clustering these into a Phytome family, generation of a multiple sequence alignment
2. identification of homologous sequence positions within the homologous proteins = of columns of amino acids that share a common ancestral amino acid
multiple sequence alignment
1. find columns that will be retained• remove columns with low average pairwise scores• remove columns with high percentage of gaps
multiple sequence alignment
1. find columns that will be retained• remove columns with low average pairwise scores• remove columns with high percentage of gaps
2. find sequences that will be retained• remove sequences with a high proportion of gaps within the retained columns• remove misaligned sequences (i.e., with a low overall score)
3. final check• are enough sequences left for a phylogeny?
phylogenetic inference
generate distance matrix
generate unrooted neighbor-joining tree
midpoint-root the tree
do molecular clock testTreePuzzle
PHYLIP
?
defining subfamilies
ghir40678
taes49609
lsat28223
taes10592
lsat22003
taes12120
pper2228
soff68095
cjap1662
zmay5764
crum2659
soff59135
sbic29242
soff91873
lsat25221
taes42042
hvul18430
stub712
nben1351
taes10593
osat87929
zmay10735
lsat24951
sbic10907
lsat35999
gmax12743
taes100462
cann3062
ptre15750
lesc54493
stub32048
ghir40662
lsat25017
ecal221
ghir36382
bvul1173
ghir31978
ghir27968
stub12723
123456
123456
123456
1234
12345678910
12
123
12
webflow, overview
search pages
result pages
Lab meeting, Sept 13, 2004: Phytome demo
Dihui - BLAST search∑ a friend of mine is working with a plant called Lophopyrum elongatum (it's a weed, and it's salt-tolerant, and that's all I know about it). She just cloned a cDNA and want to find out more about it - what it does and which other genes in which other taxa it is related to.∑ Though Lophoprum is not among the species represented in Phytome, I offered her to see if I can find out more about her gene.∑ Best to use for this: the single BLAST search.∑ Navigate to the single BLAST search and explain the page. Mention batch BLAST.∑ paste the friend's sequence into the appropriate field∑ MEYQGQQQHDQATTNRVDEYGNPVAGHGVGTGMGAHGGVGTGAAAGGHFQPTREEHKAGGILQRSGSSSSSSSSEDDGMGGRRKKGIKDKIKEKLPGGHGDQQQTAGTYGQQGHTGMAGTGGNYGQPGHTGMAGTDGTGEKKGIMDKIKEKLPGQH∑ explain the results page∑ view the best result: taes7111 from wheat∑ go to the best scoring family: 1980
Stefanie - Unigene search∑ http://www.ebi.ac.uk/interpro/IEntry?ac=IPR000167∑ search Phytome for InterproEntry 000167∑ look at the hvul1175 entry:∑ The family and subfamily ID∑ Interpro and Gene Ontology results, but only if the Unipeptide is an exemplar of its subfamily∑ The species name∑ A link to the primary source for this unigene sequence∑ A list of related unigenes (from all sources) that contain common Genbank accession numbers in their assembly∑ Predicted peptide sequence (available for download in FASTA format)
Jason - "restrict by species" search∑ You can search for families that do or do not contain members from particular species. Navigate to the "restrict by species" search and explain the page.∑ The relationships among the species are displayed as a phylogenetic tree (NCBI taxonomy information)∑ and you can select families to include or exclude using radio buttons to the right of each species name.∑ If the default "either" is selected, Phytome will return a family regardless of whether there are members from that species.∑ I'm interested in monocot gene families (Hordeum-barley to Allium-onion): want to exclude all other taxa, only use gene families with monocot members. NOTE: explain the difference between "include" monocots or "either" monocots: because species with small numbers of Unipeptides will necessarily lack members in most families, selecting "include" will return NO families!∑ 119273 families were retrieved. Their family ID is shown∑ click on family number 1980
Stefanie - family results page∑ The "Family Information Page" includeso Related families if this family is part of a superfamily (?)o Hyperlinks to subfamilies (these will work if the "Subfamily" tab is selected).o A link to a list of family members excluded from the reduced alignment by REAPo A list of those species represented within the family (these will work if the with the default species tab)∑ The tabs below allow one to viewo A list of member Unipeptides, which can be sorted either by subfamily or by species, depending on which tab is selected. From these lists, you may select members to include in a multiple alignment and/or phylogeny.o InterPro and GO assignments for an examplar of each subfamily.o By selecting multiple Unipeptides and proceeding to the "Alignment Page", one can download a single filecontaining all the predicted peptide sequences (in FASTA format) as well as additional information such as the names used by the Unigene sources and the component Genbank accession numbers.
protein family clustering(Tribe-MCL)
I = 5 3.6 2.8 2.0 1.2
1 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 12 2 1 1 12 2 1 1 13 3 2 2 13 3 2 2 13 3 2 2 13 3 2 2 13 3 2 2 13 3 2 2 13 3 2 2 13 3 2 2 14 4 3 1 14 4 3 1 14 4 3 1 14 4 2 1 15 5 4 1 15 5 4 1 16 5 4 3 16 5 4 3 1
almost 1 million EST contigs/singletons
ESTwise translation
730,000 unigenes
BLAST all vs. BLAST all
640,000 unigenes 110,000 singletonsto be clusteredinto families
...some numbers
data aquisition
species tax_id common name NCBI PGDB PGN SPNK TIGR
Allium cepa 4679 onion XAmborella trichopoda 13333 amborella XArabidopsis thaliana 3702 thale cress XAvena sativa 4498 oat XBeta vulgaris 161934 sugarbeet XBrassica napus 3708 rape XCapsicum annuum 4072 (orgnamental) pepper XCeratopteris richardii 49495 water sprite or indian fern XCitrus sinensis 2711 orange XCryptomeria japonica 3369 Japanese cedar XCucumis sativus 3659 cucumber XCycas rumphii 58031 sago palm or seashore cycad XEschscholzia californica 3467 california poppy XGlycine maxX 3847 soybean XGossypium hirsutum 3635 cotton (tetraploid) XHelianthus annuus 4232 sunflower XHordeum vulgare 4513 barley XLactuca sativa 4236 lettuce XLotus corniculatus 47247 lotus XLycopersicon esculentum 4081 tomato XMarchantia polymorpha 3197 marchantia XMedicago truncatula 3880 barrel medic XMesembryanthemum crystallinum 3544 ice plant XNicotiana benthamiana 4100 wild tobacco XOryza sativa 4530 rice XPhyscomitrella patens 3218 Physcomitrella moss XPinus taeda 3352 loblolly pine XPhaseolus coccineus 3886 scarlet runner bean XPopulus tremula x Populus tremuloides 47664 aspen XPrunus persica 3760 peach XSaccharum officinarum 4547 plume grass or sugar cane XSecale cereale 4550 rye XSolanum tuberosum 4113 potato XSorghum bicolor 4558 sorghum XStevia rebaudiana 55670 candyleaf XTheobroma cacao 3641 cacao XTriticum aestivum 4565 wheat XVitis vinifera 29760 wine grape XZea mays 4577 corn XZinnia elegans 34245 zinnia X
multiple sequence alignment
testedprogram quality speed algorithmClustalW + ++ progressiveMafft i ++ + iterativeMafft p ++ +++ progressiveT-Coffee +++ memory! consistency-based/progressiveDialign +++ time! consistency based
family ClustalW Mafft i Mafft p2 Mafft p3 T-Coffee Dialign1 2061 12380 93 312 – –2 360 845 32 73 – 84293 5108 8414 182 467 – –4 950 2470 45 101 – 125335 307 404 22 59 – 35646 87 125 9 31 – 13767 104 128 9 24 – 10758 105 114 8 20 19207 8879 46 33 6 16 11820 394
10 145 296 17 36 7736 89811 4 5 1 3 177 27