Algorithms for Alignment of Genomic Sequences
Michael BrudnoDepartment of Computer Science
Stanford UniversityPGA Workshop 07/16/2004
Conservation Implies Function
Exon
Gene
CNS:OtherConserved
Edit Distance Model (1)
Weighted sum of insertions, deletions & mutations to transform one string into another
AGGCACA--CA AGGCACACA| |||| || or | || ||A--CACATTCA ACACATTCA
Edit Distance Model (2)
Given: x, y
Define: F(i,j) = Score of best alignment ofx1…xi to y1…yj
Recurrence: F(i,j) = max ( F(i-1,j) – GAP_PENALTY,
F(i,j-1) – GAP_PENALTY,F(i-1,j-1) + SCORE(xi, yj))
Edit Distance Model (3)
F(i,j) = Score of best alignment ending at i,j
Time O( n2 ) for two seqs, O( nk ) for k seqs
F(i,j)
F(i,j-1)F(i-1,j-1)
F(i-1,j)
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
Overview• Local Alignment (CHAOS)
• Multiple Global Alignment (LAGAN)- Whole Genome Alignment
• Glocal Alignment (Shuffle-LAGAN)
• Biological Story
Local Alignment
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
F(i,j) = max (F(i,j), 0)
Return all paths with a position i,j where
F(i,j) > C
Time O( n2 ) for two seqs, O( nk ) for k seqs
Heuristic Local Alignment
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
BLAST FASTA
CHAOS: CHAins Of Seeds
1. Find short matching words (seeds)
2. Chain them
3. Rescore chain
CHAOS: Chaining the Seeds
• Find seeds at current location in seq1
locationin seq1
seedseq1
seq2
CHAOS: Chaining the Seeds
locationin seq1
distancecutoff
seedseq1
seq2
• Find seeds at current location in seq1
CHAOS: Chaining the Seeds
locationin seq1
distancecutoff
gapcutoff
seedseq1
seq2
• Find seeds at current location in seq1
CHAOS: Chaining the Seeds
• Find seeds at current location in seq1• Find the previous seeds that fall into the search box
locationin seq1
distancecutoff
gapcutoff
seed
Search box
seq1
seq2
CHAOS: Chaining the Seeds
• Find seeds at current location in seq1• Find the previous seeds that fall into the search box• Do a range query: seeds are indexed by their diagonal
locationin seq1
distancecutoff
gapcutoff
seed
Search box
seq1
seq2
Range of search
CHAOS: Chaining the Seeds
• Find seeds at current location in seq1• Find the previous seeds that fall into the search box• Do a range query: seeds are indexed by their diagonal.• Pick a previous seed that maximizes the score of chain location
in seq1
distancecutoff
gapcutoff
seed
Search box
seq1
seq2
Range of search
CHAOS: Chaining the Seeds
• Find seeds at current location in seq1• Find the previous seeds that fall into the search box• Do a range query: seeds are indexed by their diagonal.• Pick a previous seed that maximizes the score of chain location
in seq1
distancecutoff
gapcutoff
seed
Search box
seq1
seq2
Range of search
Time O(n log n), where n is number of seeds.
CHAOS Scoring
• Initial score = # matching bp - gaps
• Rapid rescoring: extend all seeds to find optimal location for gaps
Overview• Local Alignment (CHAOS)
• Multiple Global Alignment (LAGAN)- Whole Genome Alignment
• Glocal Alignment (Shuffle-LAGAN)
• Biological Story
Global AlignmentAGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
x
y
z
LAGAN: 1. FIND Local Alignments
1. Find Local Alignments
2. Chain Local Alignments
3. Restricted DP
LAGAN: 2. CHAIN Local Alignments
1. Find Local Alignments
2. Chain Local Alignments
3. Restricted DP
LAGAN: 3. Restricted DP
1. Find Local Alignments
2. Chain Local Alignments
3. Restricted DP
MLAGAN: 1. Progressive Alignment
Given N sequences, phylogenetic tree
Align pairwise, in order of the tree (LAGAN)
Human
Baboon
Mouse
Rat
MLAGAN: 2. Multi-anchoring
XZ
YZ
X/Y
Z
To anchor the (X/Y), and (Z) alignments:
Cystic Fibrosis (CFTR), 12 species
• Human sequence length: 1.8 Mb• Total genomic sequence: 13 Mb
HumanBaboon Cat Dog
Cow Pig
MouseRat
Chimp Chicken
FugufishZebrafish
CFTR (cont’d )
9055099.7%MammalsLAGAN
9086296%Chicken & Fishes
Chicken & Fishes
Mammals6704547
99.8%MLAGAN
98%
MAX MEMORY
(Mb)TIME (sec)
% Exons Aligned
Automatic computational system for Automatic computational system for comparative analysis of pairs of genomescomparative analysis of pairs of genomes http://pipeline.lbl.gov
Alignments (all pair combinations):
Human Genome (Golden Path Assembly)Mouse assemblies: Arachne, Phusion (2001) MGSC v3 (2002)Rat assemblies: January 2003, February 2003----------------------------------------------------------D. Melanogaster vs D. Pseudoobscura February 2003
Tandem Local/Global Approach•Finding a likely mapping for a contig (BLAT)
Progressive Alignment Scheme
yes
no yes no
Human, Mouse and Rat genomes
Pairwise M/R mapping
Aligned M&R fragments Unaligned M&R sequences
Map to Human GenomeMapping aligned fragments by union of M&R local BLAT hits on the human genome
H/M/R MLAGAN alignment
M/R pairwise alignment
M/H and R/H pairwise
alignment
Unassigned M&R DNA fragments
yes no
Computational Time
23 dual 2.2GHz Intel Xeon node PC cluster.
Pair-wise rat/mouse – 4 hours Pair-wise rat/human and mouse/human – 2 hours Multiple human/mouse/rat – 9 hours
Total wall time: ~ 15 hours
Distribution of Large Indels
0
20
40
60
80
100
120
140
160
180
200
100 150 200 250 300 350 400 450 500 550Indel length
Coun
t
Evolution Over a Chromosome
Overview• Local Alignment (CHAOS)
• Multiple Global Alignment (LAGAN)- Whole Genome Alignment
• Glocal Alignment (Shuffle-LAGAN)
• Biological Story
Evolution at the DNA level
…ACGGTGCAGTTACCA…
…AC----CAGTCCACCA…
Mutation
SEQUENCE EDITS
REARRANGEMENTS
Deletion
InversionTranslocationDuplication
Local & Global Alignment
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
Local Global
Glocal Alignment ProblemFind least cost transformation of one sequence into another using new operations •Sequence
edits•Inversions•Translocations•Duplications•Combinations of above
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC
Shuffle-LAGAN
A glocal aligner for long DNA sequences
S-LAGAN: Find Local Alignments
1. Find Local Alignments
2. Build Rough Homology Map
3. Globally Align Consistent Parts
S-LAGAN: Build Homology Map
1. Find Local Alignments
2. Build Rough Homology Map
3. Globally Align Consistent Parts
Building the Homology Map
da b
c
Chain (using Eppstein
Galil); each alignment
gets a score which is
MAX over 4 possible
chains.Penalties are affine (event and distance components)
Penalties: a) regularb) translocatio
n
c) inversiond) inverted
translocation
S-LAGAN: Build Homology Map
1. Find Local Alignments
2. Build Rough Homology Map
3. Globally Align Consistent Parts
S-LAGAN: Global Alignment
1. Find Local Alignments
2. Build Rough Homology Map
3. Globally Align Consistent Parts
S-LAGAN Results (CFTR)Local
Glocal
S-LAGAN Results (CFTR)Hum/MusHum/Rat
S-LAGAN Results (IGF cluster)
S-LAGAN results (HOX)• 12 paralogous genes• Conserved order in mammals
S-LAGAN results (HOX)• 12 paralogous genes• Conserved order in mammals
S-LAGAN Results (Chr 20)
• Human Chr 20 v. homologous Mouse Chr 2.• 270 Segments of conserved synteny• 70 Inversions
S-LAGAN Results (Whole Genome)
LAGAN S-LAGANTotal 37% 38%Exon 93% 96%Ups200 78% 81%CPU Time
350 Hrs 450 Hrs
• Used Berkeley Genome Pipeline• % Human genome aligned with mouse sequence• Evaluation criteria from Waterston, et al
(Nature 2002)
Rearrangements in Human v. Mouse
Preliminary conclusions:• Rearrangements come in all sizes• Duplications worse conserved than other rearranged regions• Simple inversions tend to be most common and most conserved
What is next? (Shuffle)
• Better algorithm and scoring
• Whole genome synteny mapping
• Multiple Glocal Alignment(!?)
Overview• Local Alignment (CHAOS)
• Multiple Global Alignment (LAGAN)- Whole Genome Alignment
• Glocal Alignment (Shuffle-LAGAN)
• Biological Story
Biological Story
• Math1 (Mouse Atonal Homologue 1, also ATOH) is a gene that is responsible for nervous system development
Align Human, Mouse, Rat & Fugu
Detailed Alignment
hum_a : CAATAGAGGGTCTGGCAGAGGCTC---------------------CTGGC @ 57336/400001mus_a : CAATAGAGGGGCTGGCAGAGGCTC---------------------CTGGC @ 78565/400001rat_a : CAATAGAGGGGCTGGCAGAGACTC---------------------CTGGC @ 112663/369938fug_a : TGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCGTGGGC @ 36013/68174
hum_a : CGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 57386/400001mus_a : CCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 78615/400001rat_a : CCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 112713/369938fug_a : CGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCGAGCGC @ 36063/68174
Can we align human & fly???
CGCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG GAGGTGTTGGATGGCCTGAGTGA-AGCACGCGCTGTCAGCTGGCGAGCGCTCGCG-AGTCCCTGCCGTGTCCCCGMelan GCTACTCCAGCT-ACCACCTGCATGCAGCTGCACAGCPseudo GCCACTGAGACT-GCCACCTGCATGCAGCTGCACAGA
Putting it all together
CGCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG GAGGTGTTGGATGGCCTGAGTGA-AGCACGCGCTGTCAGCTGGCGAGCGCTCGCG-AGTCCCTGCCGTGTCCCCGMelan GCTACTCCAGCT-ACCACCTGCATGCAGCTGCACAGCPseudo GCCACTGAGACT-GCCACCTGCATGCAGCTGCACAGA
Overview• Local Alignment (CHAOS)
• Multiple Global Alignment (LAGAN)- Whole Genome Alignment
• Glocal Alignment (Shuffle-LAGAN)
• Biological Story
AcknowledgmentsStanford:Serafim BatzoglouArend SidowMatt Scott
Gregory Cooper Chuong (Tom) DoSanket MaldeKerrin SmallMukund Sundararajan
Berkeley: Inna DubchakAlexander Poliakov
Göttingen:Burkhard Morgenstern
Rat Genome Sequencing Consortium
http://lagan.stanford.edu/