Date post: | 26-Mar-2015 |
Category: |
Documents |
Upload: | samantha-garcia |
View: | 214 times |
Download: | 0 times |
mosaicmosaicexploring reticulate protein family evolution
UQ, COMBIOAU, Brisbane02-03-09Maetschke/Kassahn
2
motivationmotivation
evolution is complex (horizontal gene transfer, hybridization, genetic recombination, ...)
describing reticulate (non-tree like) phylogenetic relationships as trees maybe an oversimplification
phylogenetic tree inference gets increasingly complex is not suitable phylogenetic networks are even more complex and visualization is difficult
traditional methodstraditional methods
fast method to analyze and visualize (phylogenetic) sequence relationships applied to identify and study non-tree like protein families aim to perform whole proteome scans for reticulate proteins
mosaicmosaic
the problemthe problem
3
n-grams & dot plotsn-grams & dot plots
MSKRRMSVGQQTW...MSKRRMSVGQQTW...
"alignment free" methods Split sequence in overlapping
subsequences of length n
MSKRSKRR
KRRMRRMS
...
4-grams 4-grams
phylogenetics: alignment is corner stone classical alignment may fail for reticulate proteins
M S K R R M Q Q V T Q
MSKRRMKRRM
n-gram dot plot
A B
B A
S1
S2
4
some real n-gram dot plotssome real n-gram dot plots
4-grams are "unique" for a sequence we talk about '4' later...
c=10c=10n=4n=4
>AR_Pt MEVQLGLGRVYPRPPSKTYRGAFQNLFQSVREVIQNPGPRHPEAASAAPPGASLLLQQQQQQQQQQQQQQQQQQQQQQETSPRQQQQQGEDGSPQAHRRGPTGYLVLDEEQQPSQPQSAPECHPERGCVPEPGAAVAASKGLPQQLPAPPDEDDSAAPSTLSLLGPTFPGLSSCSADLKDILSEASTMQLLQQQQQEAVSEGSSSGRAREASGAPTSSKDNYLGGTSTISDSAKELCKAV...
c=10c=10n=4n=4
c=2c=2n=1n=1
5
another n-gram dot plot another n-gram dot plot nuclear receptors
DBD: DNA binding, two zinc finger motifs LBD: Ligand binding domain AF-1/AF-2: Transcriptional activation domains DBDDBD
LBDLBD
6
n-gram sequence similarity sn-gram sequence similarity s
21
21
,min SS
SSs
max: global alignmentmax: global alignmentmin: local alignmentmin: local alignment
s s [0...1] [0...1]
number of shared n-gramsnumber of shared n-gramsS = set of n-grams, S = set of n-grams, e.g. {AAGR, AGRK, GRKQ, ...}e.g. {AAGR, AGRK, GRKQ, ...}
given two sequences and their n-gram given two sequences and their n-gram setssets S S11 and S and S22::
{AAG,AGQ,GQQ} {AAG,AGQ,GQQ} { GQQ, QQQ} = { GQQ } { GQQ, QQQ} = { GQQ }
5.02,3min
1s
7
n-gram similarityn-gram similarity
fast: linear wrt. size of n-gram sets(classical alignment is quadratic wrt. sequence length)
easy to interpret(0.5 = half of the n-grams are shared)
no parameters (gap penalty, gap extension penalty, ...)
can deal with shuffling of conserved segments and other "strange" cases (Are they actually strange?)
better or worse than BLAST/FASTA? Who knows?(Hoehl 2008: alignment free can be as good as classical alignment for inference of phylogeny, Edgar 2004: MUSCLE: n-gram based alignment method)
8
why 4 and not 42why 4 and not 42 Hoehl 2008: n= 3...5 correlation between n-gram sequence
similarity and species divergence times standard deviation of sequence similarities maximum AUC when distinguish related
and randomly shuffled sequences
MR, r=0.93
44
9
phylogenetic networksphylogenetic networks
different node and edge types Identification of reticulate events
(e.g. recombination) is error prone computational expensive larger networks become messy
T-RexT-Rex
Makarenkov et al. 2001
NeighborNet/SplitsTreeNeighborNet/SplitsTree
Bryant et al. 2004, Huson et al. 1998
NewickNewick
Cardona et al. 2008
10
larger networks - examplelarger networks - example
Huson et al. 2005 Bryant et al. 2004
11
graph = ridiculugramgraph = ridiculugram
layout dependent distorted distances random initialization local minima slow
GRGR
MRMR
PRPR
ARAR
nuclear receptorsnuclear receptors
spring layout
12
mosaic plot mosaic plot
point size is similarity no distortions no random initialization preserve full information automatic clustering
(spectral rearrangement) no hard decision about
number of clusters
13
spectral clusteringspectral clustering22 2/)1( ijseA
k
ikaD
ADL
)(, Leigve
vv22: eigenvector for 2nd smallest eigenvalue (Fiedler vector): eigenvector for 2nd smallest eigenvalue (Fiedler vector) indicates clusters and how well they are separated indicates clusters and how well they are separated
"Degree" matrix"Degree" matrix
Laplacian matrixLaplacian matrix
ssijij :n-gram similarity between sequences :n-gram similarity between sequences
Affinity matrixAffinity matrix
σσ : defines neighborhood radius : defines neighborhood radius
eigenvector eigenvector decompositiondecompositione : eigenvaluese : eigenvaluesv : eigenvectorsv : eigenvectors
A = exp(-(1-S)**2/sig)A = exp(-(1-S)**2/sig)D = diag(A.sum(axis=0))D = diag(A.sum(axis=0))L = D-AL = D-Ae,v = eigh(L)e,v = eigh(L)
14
spectral rearrangementspectral rearrangement
15
recursive spectral rearrangementrecursive spectral rearrangement
16
spectral clusteringspectral clustering takes "global" properties into account fast and scales well no random initialization
=> single run global minimum
=> single, unique solution few parameters: L, σ
σ <= mean of distance matrix "better" than k-means (works for non-spherical clusters)
or single linkage hierarchical clustering (no chaining problem) clustering is NP-hard and spectral clustering is
"just another approximation" recursive spectral clustering to improve cluster quality
17
mosaic - demomosaic - demo
18
the endthe end
fast technique to visualize/analyze reticulate protein family evolution
matrix representation spectral clustering n-gram similarity many other applications
PerlPerlfree! free!
19
questionsquestions
??
20
SCOPSCOP SCOP five families randomly selected
21
Nuclear receptorsNuclear receptorsLigand binding domain N-terminal section Zinc-finger domain
22
mosaic - examplesmosaic - examples
23
Full length sequence:Full length sequence:
G
R
MR
P
R
A
R
MrBayes v3.1.2106 generations, 4 chains240 CPU-hrs
24
Zinc finger domainZinc finger domain
AR
GR
MR
P
R
MrBayes v3.1.2106 generations, 4 chains9 CPU-hrs
25
Ligand-binding domainLigand-binding domain
PR
AR
M
R
GR
MrBayes v3.1.2106 generations, 4 chains27 CPU-hrs
26
Upstream regionUpstream region
?MrBayes v3.1.2106 generations, 4 chains87 CPU-hrs
27
quality qquality q
21
21
,min
,max
nn
SSdiagq
max: global alignmentmax: global alignmentmin: local alignmentmin: local alignment
diagdiag = set of dot sums along diagonals = set of dot sums along diagonals
qq [0...1] [0...1]
given two sequences and their n-gram dot plot:given two sequences and their n-gram dot plot:
nn = length of sequence = length of sequence
66.08,6min
0,1,2,4maxq
28
q over sq over s
29
q-spectrumq-spectrum
30
n-gram dot plotsn-gram dot plots