1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

transcript

Multiple sequence alignmentMultiple sequence alignment

Lesson 3Lesson 3

1. What is a multiple sequence 1. What is a multiple sequence alignment?alignment?

VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--

Similar to pairwise alignment BUT n sequences are aligned instead of just n=2

Multiple sequence Multiple sequence alignmentalignment

MSA = Multiple Sequence AlignmentEach row represents an individual sequenceEach column represents the ‘same’ position

VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--

Multiple sequence Multiple sequence alignmentalignment

Homosapiens

Pantroglodytes

Musmusculus

Canisfamiliaris

Gallusgallus

Anophelesgambiae

Drosophilamelanogaster

Caenorhabditis elegans

Arabidobsisthaliana

Rattusnorvegicus

Histone H4 proteinHistone H4 protein

NADH dehydrogenase subunit 4

Histone H4 protein 4

►Which is better – pairwise alignment of a pair of rows in MSA?

2. How MSAs are computed2. How MSAs are computed

Alignment – Dynamic Alignment – Dynamic ProgrammingProgramming

There is a dynamic programming algorithm for n sequences similar to the pairwise alignment

Complexity :

O(n|sequences|)

Alignment methodsAlignment methods

This is not practical complexity, therefore heuristics are used:

• Progressive/hierarchical alignment (Clustal)

• Iterative alignment (mafft, muscle)

Compute the pairwise Compute the pairwise alignments for all against all alignments for all against all

(6 pairwise alignments).(6 pairwise alignments).The similarities are The similarities are

converted to distances and converted to distances and stored in a tablestored in a table

First step:

Progressive alignmentProgressive alignment

D161410

E32313132

Cluster the sequences to create a Cluster the sequences to create a tree (tree (guide treeguide tree):):• represents the order in which pairs ofrepresents the order in which pairs of sequences are to be aligned sequences are to be aligned• similar sequences are neighbors in thesimilar sequences are neighbors in the tree tree • distant sequences are distant from eachdistant sequences are distant from each other in the tree other in the tree

Second step: ABCDE

D161410

E32313132

The guide tree is imprecise The guide tree is imprecise and is NOT the tree which and is NOT the tree which truly describes the truly describes the evolutionary relationship evolutionary relationship between the sequences!between the sequences!

Third step:A

1. Align the most similar (neighboring) pairs

sequence

Third step:A

2. Align pairs of pairs

sequence

profile

Third step:A

E sequence

profile

Main disadvantages:

• Sub-optimal tree topology

• Misalignments resulting from globally aligning pairs of sequences.

IterativeIterative alignmentalignment

Guide tree

Pairwise distance table

Iterate until the MSA does not change (convergence)

3. MSA – What is it good for?3. MSA – What is it good for?

A.A. Conserved positionsConserved positions

B.B. ConsensusConsensus

C.C. PatternsPatterns

D.D. ProfilesProfiles

E.E. Much more…Much more…

Consensus sequenceConsensus sequence

ATCTTGT

AACTTGT

AACTTCT

AACTTGT

A consensus sequence holds the most frequent character of the alignment at each column

Consensus sequence – an Consensus sequence – an exampleexample

TACGAT

TATAAT

GATACT

TATGTT

The -10 region of six promoters. There are many variants to the

“consensus.”

TACGAT

TATAAT

GATACT

TATGAT

TATGTT

TACGAT

TATAAT

GATACT

TATGAT

TATGTT

TATAAT

1 .Strict majority . *In case of equal

frequencies – choose one according to the alphabet order.

Had we searched the region upstream of genes for this consensus, we would have identified only 2 out of the 6 sequences. So we will miss many cases.

By chance, we expect a “hit” every 4,096 bp.

TACGAT

TATAAT

GATACT

TATGAT

TATGTT

TATAAT

We can search while allowing 1 mismatch.

we would have identified 3 out of the 6 sequences. So we will miss less cases.

By chance, we expect a “hit” every ~200bp → more “noise”.

TACGAT

TATAAT

GATACT

TATGAT

TATGTT

TATAAT

We can search while allowing 2 mismatches.

we would have identified all 6 sequences. So we won’t miss.

By chance, we expect a “hit” every ~30bp → A LOT OF “noise”.

TACGAT

TATAAT

GATACT

TATGAT

TATGTT

TATAAT

2. Majority only when it is a clear case. In the remaining cases – use wildcards.

Y = PyrimidineR = PurineN = Any nucleotide

TACGAT

TATAAT

GATACT

TATGAT

TATGTT

TATRNT

Reminder: Purines & PyrimidinesReminder: Purines & Pyrimidines

Y = PyrimidineR = PurineN = Any nucleotide

Had we searched the region upstream of genes with the redundant consensus, we would have identified 4/6 sequences.

By chance, we expect a “hit” every ~500 bp.

TACGAT

TATAAT

GATACT

TATGAT

TATGTT

TATRNT

There is always a tradeoff between sensitivity and specificity.Sensitivity: the fraction of true positive predictions among all positive predictions. Specificity: the fraction of true negative predictions among all negative predictions.

TATRNT TATAAT

Consensus sequence – an exampleConsensus sequence – an exampleSensitivity: the fraction of true positive predictions among all positive predictions

Specificity: the fraction of true negative predictions among all negative predictions

Permissive consensus: higher sensitivity, lower specificity (more true positives , more false positives ↔ less true negatives , less false negatives ) Nonpermissive consensus: higher specificity, lower sensitivity (less true positives , less false positives ↔ more true negatives , more false negatives )

PatternsPatterns

TACGAT

TATAAT

GATACT

TATGAT

TATGTT

[TG-]A-]TC[-]GA[-]CTA[-]T[

Patterns are more informative than consensuses sequences.

Pattern specify for each position the possible characters for this position.

Patterns - syntaxPatterns - syntax

• The standard IUPAC one-letter codes. • ‘x’ : any amino acid. • ‘][’ : residues allowed at the position. • ‘{}’ : residues forbidden at the position. • ‘()’ : repetition of a pattern element are indicated in

parenthesis. X(n) or X(n,m) to indicate the number or range of repetition.

• ‘-’ : separates each pattern element. • ‘‹’ : indicated a N-terminal restriction of the pattern. • ‘›’ : indicated a C-terminal restriction of the pattern. • ‘.’ : the period ends the pattern.

• W-x(9,11)-]FYV[-]FYW[-x(6,7)-]GSTNE[

PatternsPatterns

Any amino-acid, between 9-11

F or Y or

WOPLASDFGYVWPPPLAWSROPLASDFGYVWPPPLAWSWOPLASDFGYVWPPPLSQQQ

Profile =Profile = PSSM =PSSM = PPositionosition SSpecificpecific SScorecore MMatrixatrixACCCAA

AACCGG

AACCTT

123456

A1.6700.33.33

C0.331100

G0000.33.33

T0000.33.33

P(AACCAA)= 1 × 0.67 × 1 × 1 × 0.33 × 0.33 P(GACCAA)= 0

Sequences with higher probabilities → higher chance of being related to the PSSM.

123456

A1.6700.33.33

C0.331100

G0000.33.33

T0000.33.33

Profiles / PSSMsProfiles / PSSMs

One compares each n-mer to the profile and computes the probabilities. Sequences with probabilities > threshold are considered as hits.

Searching with PSSMSearching with PSSM

GACGGTACGTAGCGGAGCGACCAA

Computes the probability of the first 6-mer

123456

A1.6700.33.33

C0.331100

G0000.33.33

T0000.33.33

6-mers with probabilities > threshold are considered as hits .

Searching with PSSMSearching with PSSM

GACGGTACGTAGCGGAGCGACCAA

GACGGTACGTAGCGGAGCGACCAAP1

123456

A1.6700.33.33

C0.331100

G0000.33.33

T0000.33.33

Profile-pattern-consensusProfile-pattern-consensus

AACTTG

AAGTCG

CACTTC

A0.66100.

T0001.

C0.3300.660.

G000.330.

AACTTG

[AC-]A-]GC[-T-]TC[-]GC[

multiple alignment

consensus

pattern

profile

NANTNN

4. HMM:4. HMM:HHidden idden MMarkov arkov MModelsodels

Definitions & UsesDefinitions & Uses

• A probabilistic model which deals with sequences of symbols.Uses: inferring hidden states.

• Originally used in speech recognition (the symbols being phonemes)

• Useful in biology – the sequence of symbols being the DNA\Proteins.

Markov ChainsMarkov Chains• A sequence of random variables X1,X2,… where each present state depends only on the previous state.

• Weather example:

The weather in day xdepends only on day x-1:

• We can easilycompute the probability of:Sunny Sunny Rainy Sunny Sunny

Markov ChainsMarkov Chains

• Similarly we can assume a DNA sequence is Markovian • ACGGTA…(vertical or horizontal!)• These conditional probabilities can be illustrated as follows

(in DNA)

• Each arrow has a transition probability: PCA = P(xi=A|Xi-1=C)

• Thus – the probability of a sequence x will be :

ii xxLiLL PxPxxxPxP 11111 )(),...,,()(

Hidden Markov ModelsHidden Markov Models

• The state sequence itself follows a simple Markov chain. But-

• In a HMM it is no longer possible to know the state by looking at the symbols – the state is hidden.

Si+1SiSi-1

Ki+1KiKi-1

Kn. . . . . .

. . . . . .

The weather HMM exampleThe weather HMM example

• In this weather example only the actions are observable and the weather is hidden:

• {S, K, Π, P, B}

• S : {s1…sN } are the values for the hidden states

• K : {k1…kM } are the values for the observations

• The hidden states emit/generate the symbols (observations)

• Π = {Πi} are the initial state probabilities

• P = {Pij} are the state transition probabilities

• B = {bik} are the emission probabilities

HMM formalitiesHMM formalities

Si+1SiSi-1

Ki+1KiKi-1

Kn. . . . . .

. . . . . .

Another HMM example –Another HMM example –the dishonest casinothe dishonest casino

• In a casino, they use a fair dice most of the time, but occasionally switch to an unfair dice. The switch between dice can be represented by an HMM:

1: 1/62: 1/63: 1/64: 1/65: 1/66: 1/6

1: 1/102: 1/103: 1/104: 1/105: 1/106: 1/2

FAIR UNFAIR

0.950.9

1: 1/62: 1/63: 1/64: 1/65: 1/66: 1/6

1: 1/102: 1/103: 1/104: 1/105: 1/106: 1/2

0.950.9

UNFAIR

Dishonest casino - continuedDishonest casino - continued

• The symbols (observations) are the sequence of rolls:

3 5 6 2 1 4 6 3 6…

• What is hidden?

If the die is fair or unfair:

f f f f u u u f f

This is a Markov chain.

Except for that, we have:

• Emission probabilities:

Given a state, we have 6 possible matching symbols,

each with an emission probability.

1: 1/62: 1/63: 1/64: 1/65: 1/66: 1/6

1: 1/102: 1/103: 1/104: 1/105: 1/106: 1/2

FAIR UNFAIR

0.950.9

HMM of MSAHMM of MSA

• MSA can be represented by an HMM

– Insertion of A/C/G/T

– Match or Mismatch

– Deletion

HMM of MSAHMM of MSA

• MSA can be represented by an HMM

– Insertion of A/C/G/T

– Match or Mismatch

– Deletion

HMM of MSA can get more complex…HMM of MSA can get more complex…

Questions where HMM’s are Questions where HMM’s are used:used:

• Does this sequence belong to a particular

family?

• Can we identify regions in a sequence (for

instance – alpha helices, beta sheets)?

• Pairwise/multiple sequence alignment

• Searching databases for protein families

(building profiles).

1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

Documents