+ All Categories
Home > Documents > Multiple Sequence Alignment School of B&I TCD May 2010.

Multiple Sequence Alignment School of B&I TCD May 2010.

Date post: 28-Jan-2016
Category:
Upload: oliver-carter
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
25
Multiple Sequence Alignment School of B&I TCD May 2010
Transcript
Page 1: Multiple Sequence Alignment School of B&I TCD May 2010.

Multiple Sequence Alignment

School of B&I TCD May 2010

Page 2: Multiple Sequence Alignment School of B&I TCD May 2010.

MSA

• A central technique in bioinformatics– homology searching– multiple sequence alignment– phylogenetic trees

Page 3: Multiple Sequence Alignment School of B&I TCD May 2010.

An example

“all you have to do” is re-write your sequences so that similar features finish up in the same columns

Page 4: Multiple Sequence Alignment School of B&I TCD May 2010.

Evolutionary relationship

• “similar features” ideally means homologous – with a shared ancestor

• clustalW and T-coffee mimic the process of evolution– by weighting similar residues by how

conserved they are in evolution• Important AAs don’t mutate• Less important AAs change easily, even randomly

– by inserting judicious gaps

Page 5: Multiple Sequence Alignment School of B&I TCD May 2010.

Applications• Discover conserved patterns/motifs

– A step to describing a protein domain– MSA can add a distant relative to your protein

family

• To define DNA regulatory elements.

• Prediction of 2nd Structure and helps 3-D

• A step to phylogenetic trees:

• PCR analysis/primer design – find most and least degenerate regions of your

sequence

Page 6: Multiple Sequence Alignment School of B&I TCD May 2010.

So why difficult?

Trivial 2 seq alignment: 3 possibilities. As length and # of seqs increase, number of possible permutations goes astronomical

FGDERTHHSFGD--DHRS

FGDERTHHSFGDD--HRS

FGDERTHHSFGD-D-HRS

Where put the gap?

Page 7: Multiple Sequence Alignment School of B&I TCD May 2010.

Some data

• Cat ATGAAACGTCGGATCTAA• Dog ATGAATCGACCCATCTAA• Mus ATGGCGTGGCTTGGCATGTGA• Rat ATGGCATGTCGTGGCATGTAGProtocol step 1• Align each pair of seqs C-D, C-M, C-R etc• Get a score for each alignment• And make a …

Page 8: Multiple Sequence Alignment School of B&I TCD May 2010.

Similarity matrix

Cat Dog Mus Rat

Cat ID 14 10 10

Dog ID 10 10

Mus ID 16

Rat ID• Number of identical residues

– Which pair of sequences is most similar?

Page 9: Multiple Sequence Alignment School of B&I TCD May 2010.

Progressive alignment

• Align the two most similar sequences, inserting any gaps.

• Mus/Rat: lock these sequences together (call it “RODent)

• Return to similarity matrix to find next most similar seqs or sequence cluster

• Dog/Cat: align and lock (call it CARnivore)– if next step requires a gap, then gap inserted in both

carnivore sequences

• Align next most …(now its iterative)

Page 10: Multiple Sequence Alignment School of B&I TCD May 2010.

An alignment

Cat ATGAAACGTCGG---ATCTAADog ATGAATCGACCC---ATCTAAMus ATGGCGTGGCTTGGCATGTGARat ATGGCATGTCGTGGCATGTAG *** * * ** *• Good: Always a two “sequence” problem

– So computationally possible

• Bad: Can’t rewrite or decouple (part of) the dog/cat alignment in the light of later info. Locked in a (suboptimal?) trough.

Page 11: Multiple Sequence Alignment School of B&I TCD May 2010.

Choosing the right seqs

• Use MSA to inform you!• Always use AA/protein if possible

– can copygaps back to DNA later

• Start with 6-15 sequences• Eliminate very different (<30% id) seqs• Eliminate identical sequences• Watch out for partial sequences• …or sequences that need ++ gaps to align• Check for repeats with dotlet, Lalign

Page 12: Multiple Sequence Alignment School of B&I TCD May 2010.

Less is more

• Large alignments – take ++ CPU and time– are hard to do well– are difficult to display– are difficult to use: in trees for example– may include marginal seqs that wreck whole

alignment

• So start small and add/eliminate seqs until you have a clear informative picture

Page 13: Multiple Sequence Alignment School of B&I TCD May 2010.

Level of variation is important

• Choose sequence family with best rate of evolution for your taxonomic group– Histones evolve very slow (compare kingdoms)– Transferrins are fast (compare classes,orders)

• Closely related sequences may have identical protein (but variable DNA)

• Distantly related sequences no DNA signal (“saturated”)

Page 14: Multiple Sequence Alignment School of B&I TCD May 2010.

Comparing related sequences

• Case 1, human vs chimpSeq1 A C G T A A A A G C | | | | | | | | |Seq2 A A G T A A A A G C• How many changes? D=0.1 d=?• Case 2 aardvark vs human Seq1 A C G T A A A A G C | | |Seq2 A C A C G G A T A G• How many changes? D=0.7 d=?• Need to compensate for multiple hits.

G 100mya G

G 90mya G

G 70mya C

A 50mya C

C 30mya C

C 10mya G

A now G

Page 15: Multiple Sequence Alignment School of B&I TCD May 2010.

Multiple substitution

Ancestor G

G C

G

A C

G

A A

GC 1 seen

A A 0 seen

A C 1 seen

Greater distance – more likely multiple substitution

What really happened:

What diffs we can see:

Page 16: Multiple Sequence Alignment School of B&I TCD May 2010.

EBI: loads of options

Page 17: Multiple Sequence Alignment School of B&I TCD May 2010.

T-coffee

Minimal input parameters and STILL a better job than ClustalW

Page 18: Multiple Sequence Alignment School of B&I TCD May 2010.

Output EBI clustalW

Pairwise distance etcAlignmentGuidetreeWhat you submitted

Jalview alignmenteditor

Page 19: Multiple Sequence Alignment School of B&I TCD May 2010.

An alignment fragmentACT_CANAL -MDGEEVAALIIDNGSGMCKAACT_CANDU -MDGEEVAALVIDNGSGMCKAACT_PICAN -MDGEDVAALVIDNGSGMCKAACT_PICPA -MDGEDVAALVIDNGSGMCKAACT_KLULA -MDS-EVAALVIDNGSGMCKAACT_YEAST -MDS-EVAALVIDNGSGMCKAACT_YARLI -MED-ETVALVIDNGSGMCKAACT2_ABSGL MSMEEDIAALVIDNASGMCKAACT2_SCHCO --MDDEIQAVVIDNGSGMCKA : *:::**.******

* All AA in column identical: AA similar size & hydrophobicity. AA similar size or hydrophobicity

ClustalW format

Page 20: Multiple Sequence Alignment School of B&I TCD May 2010.

The alignment, so what next?

• Look at it very closely

• Hand edit if necessary (probably)

• Eliminate problem sequences and redo?

• Use display option best for next step– Phylip format for trees

Page 21: Multiple Sequence Alignment School of B&I TCD May 2010.

Parameter changes

• Substit matrix PAM, Gonnet, Blosum – Clustalw chooses which matrix within family

• PAM30 for closely related pairs; PAM120; PAM250 for more distant

– Difficult alignment: matrix change may help• Gap penalty (open and extend) have optimal

values for each family: find which by trial and error.– Clustalw puts gaps (which are often external loops)

near previous gaps (longer loop)• MSA does the grunt work. YOU do the fine

tuning.

Page 22: Multiple Sequence Alignment School of B&I TCD May 2010.

Alignment display: weblogo

Always remember: sequence represents a 3-D structure

Page 23: Multiple Sequence Alignment School of B&I TCD May 2010.

Patterns to recognise(more reliable in MSA than in single seq)

• Alternate hydrophobic residues– Surface -sheet (zig-zag-zig-zag)

• Runs of hydrophobic residues– Interior/buried -sheet

• Residues with 3.5AA spacing (amphipathic)– -helix WNNWFNNFNNWNNNF

• Gaps/indels– Probably surface not core

MSA improves 2ndary structure (-helix -sheet) prediction by >6%)

Page 24: Multiple Sequence Alignment School of B&I TCD May 2010.

Conserved residues• W,F,Y large hydrophobic, internal/core

– conserved WFY best signal for domains

• G,P turns, can mark end of -helix -sheet• C conserved with reliable spacing speaks C-C

disulphide bridges - defensins• H,S often catalytic sites in proteases (and other

enzymes)• KRDE charged: ligand binding or salt-bridge• L very common AA but not conserved

– except in Leucine zipper L234567L234567L234567L

Page 25: Multiple Sequence Alignment School of B&I TCD May 2010.

Finish with an alignment:defensins

3 pairs of C residues: 3 disulphide bridges


Recommended