Algorithms for Alignment of Genome Sequences

Post on 04-Jan-2017

222 views 0 download

transcript

Algorithms for Alignment of Genomic Sequences

Michael BrudnoDepartment of Computer Science

Stanford UniversityPGA Workshop 07/16/2004

Conservation Implies Function

Exon

Gene

CNS:OtherConserved

Edit Distance Model (1)

Weighted sum of insertions, deletions & mutations to transform one string into another

AGGCACA--CA AGGCACACA| |||| || or | || ||A--CACATTCA ACACATTCA

Edit Distance Model (2)

Given: x, y

Define: F(i,j) = Score of best alignment ofx1…xi to y1…yj

Recurrence: F(i,j) = max ( F(i-1,j) – GAP_PENALTY,

F(i,j-1) – GAP_PENALTY,F(i-1,j-1) + SCORE(xi, yj))

Edit Distance Model (3)

F(i,j) = Score of best alignment ending at i,j

Time O( n2 ) for two seqs, O( nk ) for k seqs

F(i,j)

F(i,j-1)F(i-1,j-1)

F(i-1,j)

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

Overview• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

Local Alignment

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

F(i,j) = max (F(i,j), 0)

Return all paths with a position i,j where

F(i,j) > C

Time O( n2 ) for two seqs, O( nk ) for k seqs

Heuristic Local Alignment

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

BLAST FASTA

CHAOS: CHAins Of Seeds

1. Find short matching words (seeds)

2. Chain them

3. Rescore chain

CHAOS: Chaining the Seeds

• Find seeds at current location in seq1

locationin seq1

seedseq1

seq2

CHAOS: Chaining the Seeds

locationin seq1

distancecutoff

seedseq1

seq2

• Find seeds at current location in seq1

CHAOS: Chaining the Seeds

locationin seq1

distancecutoff

gapcutoff

seedseq1

seq2

• Find seeds at current location in seq1

CHAOS: Chaining the Seeds

• Find seeds at current location in seq1• Find the previous seeds that fall into the search box

locationin seq1

distancecutoff

gapcutoff

seed

Search box

seq1

seq2

CHAOS: Chaining the Seeds

• Find seeds at current location in seq1• Find the previous seeds that fall into the search box• Do a range query: seeds are indexed by their diagonal

locationin seq1

distancecutoff

gapcutoff

seed

Search box

seq1

seq2

Range of search

CHAOS: Chaining the Seeds

• Find seeds at current location in seq1• Find the previous seeds that fall into the search box• Do a range query: seeds are indexed by their diagonal.• Pick a previous seed that maximizes the score of chain location

in seq1

distancecutoff

gapcutoff

seed

Search box

seq1

seq2

Range of search

CHAOS: Chaining the Seeds

• Find seeds at current location in seq1• Find the previous seeds that fall into the search box• Do a range query: seeds are indexed by their diagonal.• Pick a previous seed that maximizes the score of chain location

in seq1

distancecutoff

gapcutoff

seed

Search box

seq1

seq2

Range of search

Time O(n log n), where n is number of seeds.

CHAOS Scoring

• Initial score = # matching bp - gaps

• Rapid rescoring: extend all seeds to find optimal location for gaps

Overview• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

Global AlignmentAGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

x

y

z

LAGAN: 1. FIND Local Alignments

1. Find Local Alignments

2. Chain Local Alignments

3. Restricted DP

LAGAN: 2. CHAIN Local Alignments

1. Find Local Alignments

2. Chain Local Alignments

3. Restricted DP

LAGAN: 3. Restricted DP

1. Find Local Alignments

2. Chain Local Alignments

3. Restricted DP

MLAGAN: 1. Progressive Alignment

Given N sequences, phylogenetic tree

Align pairwise, in order of the tree (LAGAN)

Human

Baboon

Mouse

Rat

MLAGAN: 2. Multi-anchoring

XZ

YZ

X/Y

Z

To anchor the (X/Y), and (Z) alignments:

Cystic Fibrosis (CFTR), 12 species

• Human sequence length: 1.8 Mb• Total genomic sequence: 13 Mb

HumanBaboon Cat Dog

Cow Pig

MouseRat

Chimp Chicken

FugufishZebrafish

CFTR (cont’d )

9055099.7%MammalsLAGAN

9086296%Chicken & Fishes

Chicken & Fishes

Mammals6704547

99.8%MLAGAN

98%

MAX MEMORY

(Mb)TIME (sec)

% Exons Aligned

Automatic computational system for Automatic computational system for comparative analysis of pairs of genomescomparative analysis of pairs of genomes http://pipeline.lbl.gov

Alignments (all pair combinations):

Human Genome (Golden Path Assembly)Mouse assemblies: Arachne, Phusion (2001) MGSC v3 (2002)Rat assemblies: January 2003, February 2003----------------------------------------------------------D. Melanogaster vs D. Pseudoobscura February 2003

Tandem Local/Global Approach•Finding a likely mapping for a contig (BLAT)

Progressive Alignment Scheme

yes

no yes no

Human, Mouse and Rat genomes

Pairwise M/R mapping

Aligned M&R fragments Unaligned M&R sequences

Map to Human GenomeMapping aligned fragments by union of M&R local BLAT hits on the human genome

H/M/R MLAGAN alignment

M/R pairwise alignment

M/H and R/H pairwise

alignment

Unassigned M&R DNA fragments

yes no

Computational Time

23 dual 2.2GHz Intel Xeon node PC cluster.

Pair-wise rat/mouse – 4 hours Pair-wise rat/human and mouse/human – 2 hours Multiple human/mouse/rat – 9 hours

Total wall time: ~ 15 hours

Distribution of Large Indels

0

20

40

60

80

100

120

140

160

180

200

100 150 200 250 300 350 400 450 500 550Indel length

Coun

t

Evolution Over a Chromosome

Overview• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

Evolution at the DNA level

…ACGGTGCAGTTACCA…

…AC----CAGTCCACCA…

Mutation

SEQUENCE EDITS

REARRANGEMENTS

Deletion

InversionTranslocationDuplication

Local & Global Alignment

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

Local Global

Glocal Alignment ProblemFind least cost transformation of one sequence into another using new operations •Sequence

edits•Inversions•Translocations•Duplications•Combinations of above

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

Shuffle-LAGAN

A glocal aligner for long DNA sequences

S-LAGAN: Find Local Alignments

1. Find Local Alignments

2. Build Rough Homology Map

3. Globally Align Consistent Parts

S-LAGAN: Build Homology Map

1. Find Local Alignments

2. Build Rough Homology Map

3. Globally Align Consistent Parts

Building the Homology Map

da b

c

Chain (using Eppstein

Galil); each alignment

gets a score which is

MAX over 4 possible

chains.Penalties are affine (event and distance components)

Penalties: a) regularb) translocatio

n

c) inversiond) inverted

translocation

S-LAGAN: Build Homology Map

1. Find Local Alignments

2. Build Rough Homology Map

3. Globally Align Consistent Parts

S-LAGAN: Global Alignment

1. Find Local Alignments

2. Build Rough Homology Map

3. Globally Align Consistent Parts

S-LAGAN Results (CFTR)Local

Glocal

S-LAGAN Results (CFTR)Hum/MusHum/Rat

S-LAGAN Results (IGF cluster)

S-LAGAN results (HOX)• 12 paralogous genes• Conserved order in mammals

S-LAGAN results (HOX)• 12 paralogous genes• Conserved order in mammals

S-LAGAN Results (Chr 20)

• Human Chr 20 v. homologous Mouse Chr 2.• 270 Segments of conserved synteny• 70 Inversions

S-LAGAN Results (Whole Genome)

LAGAN S-LAGANTotal 37% 38%Exon 93% 96%Ups200 78% 81%CPU Time

350 Hrs 450 Hrs

• Used Berkeley Genome Pipeline• % Human genome aligned with mouse sequence• Evaluation criteria from Waterston, et al

(Nature 2002)

Rearrangements in Human v. Mouse

Preliminary conclusions:• Rearrangements come in all sizes• Duplications worse conserved than other rearranged regions• Simple inversions tend to be most common and most conserved

What is next? (Shuffle)

• Better algorithm and scoring

• Whole genome synteny mapping

• Multiple Glocal Alignment(!?)

Overview• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

Biological Story

• Math1 (Mouse Atonal Homologue 1, also ATOH) is a gene that is responsible for nervous system development

Align Human, Mouse, Rat & Fugu

Detailed Alignment

hum_a : CAATAGAGGGTCTGGCAGAGGCTC---------------------CTGGC @ 57336/400001mus_a : CAATAGAGGGGCTGGCAGAGGCTC---------------------CTGGC @ 78565/400001rat_a : CAATAGAGGGGCTGGCAGAGACTC---------------------CTGGC @ 112663/369938fug_a : TGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCGTGGGC @ 36013/68174

hum_a : CGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 57386/400001mus_a : CCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 78615/400001rat_a : CCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 112713/369938fug_a : CGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCGAGCGC @ 36063/68174

Can we align human & fly???

CGCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG GAGGTGTTGGATGGCCTGAGTGA-AGCACGCGCTGTCAGCTGGCGAGCGCTCGCG-AGTCCCTGCCGTGTCCCCGMelan GCTACTCCAGCT-ACCACCTGCATGCAGCTGCACAGCPseudo GCCACTGAGACT-GCCACCTGCATGCAGCTGCACAGA

Putting it all together

CGCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG GAGGTGTTGGATGGCCTGAGTGA-AGCACGCGCTGTCAGCTGGCGAGCGCTCGCG-AGTCCCTGCCGTGTCCCCGMelan GCTACTCCAGCT-ACCACCTGCATGCAGCTGCACAGCPseudo GCCACTGAGACT-GCCACCTGCATGCAGCTGCACAGA

Overview• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

AcknowledgmentsStanford:Serafim BatzoglouArend SidowMatt Scott

Gregory Cooper Chuong (Tom) DoSanket MaldeKerrin SmallMukund Sundararajan

Berkeley: Inna DubchakAlexander Poliakov

Göttingen:Burkhard Morgenstern

Rat Genome Sequencing Consortium

http://lagan.stanford.edu/