BioInformatics Consultation Practice 8 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account:...

BioInformatics ConsultationPractice 8

Gábor Pauler, Ph.D.

Tax.reg.no: 63673852-3-22Bank account: 50400113-11065546

Location: 1st Széchenyi str., 7666 Pogány, HungaryTel: +36-309-015-488

E-mail: [email protected]

Content of the PracticeMultiple sequence alignment

Basic termsSearching Conserved Regions/Domains/Motivs/Patterns

PurposesDegrees of similarity

Searching Rapid changing regionsClustering methods

Similarity metricsPartial ProximityMultivariate Homolog ProximityComplex Proximity Metrics

Algorithm typesK-mean

EvaluationHierarchic DistributiveHierarchic Agglomerative

StepsEvaluation

Resolving value concentration problemPhilogenic tree analysis

Basic termsObstaclesSoftware:

ClustalW2Main screenOutputs

References

Multiple sequence alignment:Basic terms- It compares large number of sequences to deal with the following things: Conserved Regions/Domains/Motivs/Patterns (Konzervált régiók/domének/motívomok):

- Coding parts of structurally sensitive proteins, being under consderable evolutional pressure: vaste majority of mutations destroy them, therefore these have to be well conserved in survivors. We can search them 4 possible purpose:

- At Chromosome Walking(Kromoszóma lépkedés) we search for matching fragment ends, to assemble longest posssible contig

- At Gene Search(Génkeresés) we are looking for Expressed Sequence Tags (EST)- In recognized genes, we can search for parts coding enzimes/active regions- Create Philogenic Tree(Gén-családfa) based on similarity of sequences, inferring

descendance of relatively distant organisms from each other. Grades of Similarity(Hasonlóság) are:

- Analogy(Analógia):sequences code same function, but have different origin

- Homology(Homológia):sequen-ces code same function, have common origin

- Paralogy(Paralógia):sequences code same (or slightly modified) function, have common origin, but in common ancestor they separated with Gene Duplication, so they are in different regions of genome

- Orthology(Ortológia):sequences code same function, have com-mon origin, and they are in same place in the genome

Comparing rapidly changing regions: They are non-coding parts, or code stucturally less sensitive proteins (eg. fibrin), survavibility sustained even at rapid variation

- Therefore we use them analyzing descendance of closely related organisms

Multiple sequence alignment:Clustering methods:Similarity metrics1- Grouping sequences bases on

Clustering Methods (Klaszterezési Módszerek) of multivariate statistics creating groups from observed objects.

Clustering methods are based on Similarity/ Matching/ Proximity Metrics (Hasonlósági metrikák):all of them try to put similar objects into one group and dissimilar objects in separate groups. But how we can measure similarity?

Univariate, Partial Proximity (Egy változós, Parciális hasonlóság): How we can compare a given position in 2 sequences?

- Nominal(Nominális): {0,1} discrete valued distance:

- Eg. Identical nucleotids or not in a position of nucleotid sequence

- Cardinal(Kardinális): [0,1] continous intervall distance:

- Eg. at nucleotids, we can give a fraction distance if at least pirimidin/purin group of nucleotids are identical,

- Eg. at amino acids we can set up a contionous distance scores from physical properties

A G C TA 0 1 1 1G 1 0 1 1C 1 1 0 1T 1 1 1 0

A G C TA 0.0 0.5 1.0 1.0G 0.5 0.0 1.0 1.0C 1.0 1.0 0.0 0.5T 1.0 1.0 0.5 0.0

Amino Acids

Color Code

Properties

AVFPMI Red Small (small+ hydrophobic (incl.aromatic -Y))DE Blue AcidicRK Purple BasicSTYHCN Green Hydroxyl + Amine + Basic - QOthers Gray

Multiple sequence alignment:Clustering methods:Similarity metrics2 Multivariate Homolog Proximity Measures (Több változós

hasonlóság homológ objektumok közt): Homolog sequences have numerous positions (i=1..n). How to aggregate positionwise distances into single distance metrics?

- Its easier to imagine it graphically: each positions value forms a cordinate axis in a n-dimensional coordinate system called Decision Space (Döntési Tér). Each compared sequences appear there as 1 coordinate point. We try to measure a distance between 2 coordinate points with alternative methods:

We can summarize mismatch for i=1..n position long-sequences with Manhattan Distance Manhattan távolság):

Mismatch % = (iScorei)/n (8.1)

Because of the simple summing up, it is called Full compensatory(Teljesen Kompenzáló) distance metrics, but it can be misleading! If ET’s genome has also many repetitive parts, even he can be considered similar to humans because of numerous small random match, which fully compensate essential longer non-matching parts!

- Graphically, it is thedistance between points moving on a „grid”- Equally distant points are on a square rotated by 45°- Distance function from a point forms a pyramid rotated by 45°

ATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGCTTATCGTACGTACTTGAACTGACGTAGCGTAGCGTAGCGTAGCGTAGCGTAGCGTAGCGTAGACACGATACGTACGGCCTGATAC

Pos1

Pos

3

Pos2

A G C T

*

A G C T *

A G C T *

TGC

GTG

Pos1

Pos2

A G C T

*

A G C T *

CC

?≠=

?≠=

Multiple sequence alignment:Clustering methods:Similarity metrics3- Alternatively, we can compute Euclidean distance

(Euklideszi távolság) for n position long-sequences:

Mismatch % = (iScorei2/n)0.5 (8.2)

- Please note that it squares error scores making big differences even bigger. This way lot of small random matches cannot compensate a longer mismatching region, because Euclidean Distance is not full compensatory (Nem teljesen kompenzáló) distance metrics

- Graphically, it is the „straight” distance between points

- Equally distant points are on a circle- Distance function from a point forms a cone

Complex Proximity Metrics (Komplex hasonlósági metrikák): Unfortunately, frameshift mutations can insert/delete nucleotids into originally homolog sequences, therefore they cannot be compared postion-by-position:

- We use BLASTP-type word search algorithms to search most compatible parts between 2 sequences, and BLASTP-matching score is used as distance

ATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGCTTATCGTACGTACTTGAACTGACGTAGCGTAGCGTAGCGTAGCGTAGCGTAGCGTAGCGTAGACACGATACGTACGGCCTGATAC

Pos1

Pos2

A G C T

*

A G C T *

CC

Pos1

Pos

3

Pos2

A G C T

*

A G C T *

A G C T *

TGC

GTG








StepsEvaluation




References

Multiple sequence alignment:Clustering methods:Algorithm types1 K-mean Clustering (K-közép klaszterezés)

- It can be used only if we know the number of groups to create (k)

- It starts with k random equal lenght (n) sequences as group centroids( ). They are coordinates in n dimensional space

- During numerous iterations group centroids distract each other in space (eg. they try to be as different sequences as possible), however they are attracted by large groups of coordinate points( ) of m observed sequences, moving towards them, in decreasing steps () (eg. they try to became the compromise sequence of large groups)

- Iteration stops if aggregated movement of group centroids does not exceed a treshold anymore, reaching Ljapunov-Stability (Ljapunov-stabilitás): they can resonate forth and back, but basically will not move.

- Finally, group centroids distribute observed sequences: they are grouped to nearest group centroid, forming groups

- Evaluation of K-mean Clustering:- It is less sensitive to outlier sequences (eg. result of

sequencing errors, or frameshift mutation)- It has low computational requirement growing linearly

with number of observed sequences m and number of groups k

- It can delimit only Compact Shaped Clusters (Kompakt csoportok) in decision space and gets uncertain in case of strong mutation cross-effects across sequence positions, resulting Spurious Clusters (Elnyújtott Alakú Klaszter) in decision space (eg. some of the sample sequences were formed from other samples with translocation mutation)

- All groups will be on the same level of grouping, they cannot be ordered in hierarchy

- It can be used only if we know number of groups in advance. In philogenic analysis, this is not the usual situation!

Pos1

Pos

3

Pos2

A G C T

*

A G C T *

A G C T *GTGG*GGTA

TTTCCTCTT

CGT

Pos1

Pos

3

Pos2

A G C T

*

A G C T *

A G C T *

GTGG*GGTA

TTTCCTCTT

CGT

Multiple sequence alignment:Clustering methods:Hierarchic:Agglomerative1 Hierarchic Clustering (Hierarchikus klaszterezés): It

creates hierarchy of groups- Distributive Method (Disztributiv módszer): At the

begginning it treats whole sample of sequences as one group then it splits them to subgroups

- Agglomerative method (Agglomeratív módszer): just the reversed, more usable in practice:

- STEP 0: At start, m observed sequences( ) form m separate groups

- STEP 1: It compares still existing groups pairwise and the nearest two are agglomera-ted () into one group( ). There are 2 methods to compute distance of groups:

- Nearest Neighbor Method (Legközelebbi szomszéd): distance of 2 groups ( ) already containing sevearal sequences are defined by their closest members:

- This detects spurious clusters well - But extremely sensitive to Outlier

(Kilógó) sequences (eg. result of sequencing error), they form separate group against the rest of the smple, which is misleading

- Ward Method (Ward-módszer): distance of 2 groups are given by the variance of their joint members in decision space.

- This can detect only more compact clusters than nearest neighbor

- But still can detect more spurious clusters than K-mean

- Less sensitive to outlier sequences- Higher computational requirement

Multiple sequence alignment:Clustering methods:Hierarchic:Agglomerative2- STEP 2: It records how much distance of 2

groups was disappeared by their agglomeration, and plots it on Scree Plot (Könyök diagramm): line chart representing info loss in a given iteration of algorithm

- STEP 3: It records on a binary Dendo-gramm (Bináris fa diagramm) which two groups (Leaf elements, Levél elem), were aggregated in a new group (Branch element, Ág elem). Lenght of branches expresses the distance disappeared

- STEP 4: GOTO STEP 1 until m initial groups are aggregated into 1 in m-1 iteration

- Termination: Of course agglomerating everything into 1 group does not make any sense. However during m-1 iterations we observe Scree plot:

- If info loss suddenly jumps up after a given iteration, it signals that further iterations can be deleted. This is how it detects the number of groups to leave.

- If scree plot has no break at all and resembles a mirrored 1/x function, there are no distinctive groups, or clusters were too spurious to detect

- Multiple big „steps” on scree plot shows multi level-hierarchic group structure

GTG

G*G

GTA

TTT

CCT

CTT

CGT

Distance

Distance

7 6 5 4 3 2 1 Clusters left

Stop!

Multiple sequence alignment:Clustering methods:Agglomerative3- Evaluation of agglomerative hierarchic clustering:

- It can create binary group hierarchy. From the dendogramm we can read Philogenic Tree (filogenikus fa)

- It detects number of groups automatically- It has exact termination criteria instead of Ljapunov-stability treshold of K-Mean

clustering- Unfortunately it has much more high computational requirement: it grows quadratic

with numbr of observations m. CURE: This can be resolved if we have to group large number of sequences (eg. 10000s), then we pre-group them with K-Mean into 1000 groups, and from each group we select the sequence nearest to group centroids. These selected sequences are weighted with pre-group sizes and grouped further with hierarchic agglomerative method

- It is more sensitive for outliers than K-mean: in case of many outliers it typically creates very uneven sized, very Instabile Clusters (Instabil csoport): adding/removing one sequence to the sample will lead totally different results. Therefore it is important to check out group sizes, even we are primary interested in dendogram, because unbalanced group sizes will warn to distorted, unbalanced, useless dendogramm! CURE: Outliers should be removed from sample before clustering (ET, go home!)

- It does not really work well if there is Value Concen-tration Problem (Érték-koncentrációs probléma): there are only low number of possible values in one seq-uence position (eg. for nucleotides A,C,T,G) so partial distances are not continous but discrete, even {0,1} binary valued. This will result in huge amount of identical distances among observed sequences, and agglomeration gets confused which of them to agglomerate first. Therefore grouping will be again very instabile. CURE: We can resolve this only painfully computation-intensive auxiliary methods embedded in agglomerative clustering algorithm:

CG

CCGC

GT

GG








StepsEvaluation




References

Multiple sequence alignment: Clustering methods: Hierarchic: Resolving value concetration 1- Agglomeration-optimization with dynamic

programming:- The optimal pair to agglomerate from

many identical distances could be determined by Dynamic Integer Linear Programming Model, with B&B algorithm

- This would give exactly the potimal grouping and hierarchy of sequences

- It has infeasibly colossal computational requirement

- Progressive Method:- At first sequences are pairwise matched

with BLASTP-type word search algorithm- For the best matching pair Alignment

String(Illeszkedési Sztring) is determined and represents the agglomerated group. In further iterations, matching strings are used to pairwise match groups.

- By default BLASTP algorithm has higher computational requriement than distance computation, but much less than B&B optimization

- Moreover, as shorter match strings are matched instead of whole sequences, it reduces computation requirement considerably

- In a given position of the matching strings, there are much more possible values because of partial matches than positions in nucleotide sequences. This resolves value concentration problem

- It also can handle frameshift mutations

CG

CCGC

GT

GG

NotGrpGroup NotGrp

Group

NotGrpGroup

NotGrpGroup NotGrp

Group

NotGrpGroup NotGrp

Group

NotGrpGroup

cCTt

Multiple sequence alignment: Clustering methods: Hierarchic: Resolving value concetration 2- Weighted Progressive Method:

- Same as above except that it computes a weight for each sample sequences: it is the average of BLASTP-match scores at all other m-1 sequences compared

- Higher weighted sample sequences/groups tend to be near group centroids, so they are preferred at agglomeration

- It has very slightly higher computational requirement than Progressive method

- But it is even more effective- Iterative Method:

- Same as above, just it allows to compare matching string of a sample sequence/group with already clustered sequences/subgroups also

- It has considerably higher computational requirement than progressive methods, growing on third power with number of sample sequences m

- Can perform well if dendogramm has many, almost equally good alternavie solutions because of spurious clusters

- Motive Method:- Same as progressive method, just

searching of matching Expressed Sequence Tags (EST) is executed before agglomeration decision. ESTs are described by weight matrices specific of the organisms of origin of sample sequences. ESTs are searched by HMM-algorithm.

+1:cCTt

+1:*TT

+1:cCTt

AGCT

AGC

T

EST

Multiple sequence alignment:Philogenic tree analysis- Basic terms:

- It is a tree-structure, whose leaves are analysed sequences- Defined by a hierarchic clustering method using a given distance metrics of

sequences- From similarities/distances we try to infer that Most Recent Common Ancestor

(Legutobbi Közös Ős) sequence is branched in which other sequences- Using the Molecular Clock (Molekláris Óra) hypothesis, which assumes that

forming a given quantity of genetic modification (mutations, recombinations) requires given number of generations in average. This way, from distances we can compute the Coalescence time (Szétválási Idő), so tree can be represented on time axis

Obstacles of Philogenic analysis: Deviant Sequence (Deviáns szekvencia): sequence containing inproportionally lot

of Match Gaps (Illesztési rés) in the given group, it is best to omit it from analysis Long Branch Attraction (Ál-távoli rokon effektus): proven closely related organisms

sometimes show up much more distant in tree - resulting in long branch - because the following factors can have different speed in the analysed sub-species:

- Evolution Rate (Evolúciós Ráta)% = mutation probability% × survival probability% (8.3)

- Speed of DNA Repair (DNS javító) mechanisms Silent Mutation (Csendes mutáció): Many mutations are in non-coding parts

without evolutional pressure Gene Convergence (Génkonvergencia): originally very distant organisms living in

same niche conditions tend to express very similar proteins. Eg. elephants are originally more closely related to mice than mammoths, regardless external similarity

Horizontal gene transfer (Horizontális géntranszfert): Retroviruses (Retrovírus) write their genome back in cellular genome, creating gene transfer within one generation, without any descendancy, creating pseudo-paralogy between very distant organisms








StepsEvaluation




References

Multiple sequence alignment:Software:ClustalW2:Main screen1 http://www.ebi.ac.uk/Tools/clusta

lw2/index.html :At Main Screen:

- Set of sequences: copy them after each other in FASTA format, give your E-mail and TITLE of the analysis

- Alignment: Full/Fast: you can select between full distance computation and word search

- KTUP word size: def/1..5: word size at word search, def everywhere means automatic optimization!

- Window lenght: def/0..10: max length of HSP windows

- Score type: absolute/percent: scoring matrix values are used as it is, or their sum is norma-lized to 100%

- Top diag: def/1..10: max number of insertion mutations between sequence pairs can be handled in the same time

- Matrix:def/blosum30/pam350/ gonnet250: type of score matrix

- Pairgap: def/1..500: punishment weight of gap in starting position of matching parts in a pair

ClickClick

ClickClick

ClickClick

ClickClick

ClickClick

ClickClick

ClickClick

Multiple sequence alignment:Software:ClustalW2:Main screen2- GAP Open/ Extend/ Distance:

punishment weight for opening/ extending gap or contunied match at large distance within HSPs

- No end gap: Yes/No: gaps not allowed at the end of sequence (means equal lenght sequences)

- Iteration, Numiter: use match iteration, max how much levels step back in tree to compare

- Run button: Execute- At proteins, we should consider:

- Matrix selection: lower BLOSUM or higher PAM detects more Divergent (Szétszórt) matches in a large region, the opposite detects Convergent (Összetartó) match in one block:

- Synchronizing matrix:with other parameters (see table)

- At nucleotides, we can select:- Nukleotid Similarity Matrix

(Nukleotid hasonlósági mátrix): no partial match, or

- PUPPY: partial match inside Purin bases (Purin bázis): (AG) and Pirimydin bases (Pirimidin bázis):(CTU) group

Protein Gap open Gap extendlenght Punishm. Punishm.

>300 BLOSUM50 -10 -285-300 BLOSUM62 -7 -150-85 BLOSUM80 -16 -4>300 PAM250 -10 -285-300 PAM120 -16 -435-85 MDM40 -12 -2<=35 MDM20 -22 -4<=10 MDM10 -23 -4

Matrix

2 levels back

ClickClick

ClickClick C

lickC

lick

ClickClick

ClickClick

ClickClick

Click Click

Multiple sequence alignment:Software:ClustalW2:Outputs 1 Summary:

- Overview table:- Jalview graphic

match browser:- Matches can

be overriden manually

- With Calcu-late|Tree menu shows the tree:

Score table: Detailed matches:

- Amino acids are colored with standard colors

- Based on their biochemical properties:

SeqA Name Len(aa) SeqB Name Len(aa) Score1 S.meliloti 386 2 R.sp 386 831 S.meliloti 386 3 P.stutzeri 386 721 S.meliloti 386 4 A.ehrlichei 386 721 S.meliloti 386 5 P.aerugi 386 681 S.meliloti 386 6 M.algicola 388 672 R.sp 386 3 P.stutzeri 386 752 R.sp 386 4 A.ehrlichei 386 712 R.sp 386 5 P.aerugi 386 732 R.sp 386 6 M.algicola 388 703 P.stutzeri 386 4 A.ehrlichei 386 763 P.stutzeri 386 5 P.aerugi 386 773 P.stutzeri 386 6 M.algicola 388 744 A.ehrlichei 386 5 P.aerugi 386 704 A.ehrlichei 386 6 M.algicola 388 715 P.aerugi 386 6 M.algicola 388 71

AVFPMILW Red Small (small+ hydrophobic (incl.aromatic -Y))DE Blue Acidic

RK Purple BasicSTYHCNGQ Green Hydroxyl +

Amine + Basic - Others Gray

ClickClick

ClickClick

TreeTreeClickClick

Multiple sequence alignment:Software:ClustalW2:Outputs 2 Overview trees:

- Filogram tree with/without distance data: shows proportional lenght of branches

- Cladogram tree with/without distance data: branch distances are equalized at leaf elements to eliminate graphically disturbing long branch effects

- Tree structure as standard DND script:

(((S.meliloti:0.09650,R.sp:0.07189):0.06304,A.ehrlichei:0.13385):0.00826,(P.stutzeri:0.09343,P.aerugi:0.13196):0.00761,M.algicola:0.14653);

Multiple sequence alignment:http://en.wikipedia.org/wiki/Multiple_sequence_alignmentLinks: http://pbil.univ-lyon1.fr/alignment.html

Clustering methods:http://en.wikipedia.org/wiki/Cluster_analysis www.iis.sinica.edu.tw/~hil/summer/sorin/sorin2-2.ppt http://www.technion.ac.il/docs/sas/stat/chap42/sect11.htm

Philogenic tree analysis:http://en.wikipedia.org/wiki/Phylogenetic_tree http://www.cs.huji.ac.il/course/2006/cbio/Scribes/lect13-yaar/lect13.pdf

Philogenic software:EBI ClustalW: http://www.ebi.ac.uk/Tools/clustalw2/index.html BCM: http://searchlauncher.bcm.tmc.edu/multi-align/multi-align.html STRAP: http://www.bioinformatics.org/strap/

References

Date post:	02-Jan-2016
Category:	Documents
Upload:	sandra-wilkinson
View:	217 times
Download:	3 times

BioInformatics Consultation Practice 8 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account:...

Documents