+ All Categories
Home > Documents > Algorithms for Alignment of Genome Sequences

Algorithms for Alignment of Genome Sequences

Date post: 04-Jan-2017
Category:
Upload: trandat
View: 222 times
Download: 0 times
Share this document with a friend
59
Algorithms for Alignment of Genomic Sequences Michael Brudno Department of Computer Science Stanford University PGA Workshop 07/16/2004
Transcript
Page 1: Algorithms for Alignment of Genome Sequences

Algorithms for Alignment of Genomic Sequences

Michael BrudnoDepartment of Computer Science

Stanford UniversityPGA Workshop 07/16/2004

Page 2: Algorithms for Alignment of Genome Sequences

Conservation Implies Function

Exon

Gene

CNS:OtherConserved

Page 3: Algorithms for Alignment of Genome Sequences

Edit Distance Model (1)

Weighted sum of insertions, deletions & mutations to transform one string into another

AGGCACA--CA AGGCACACA| |||| || or | || ||A--CACATTCA ACACATTCA

Page 4: Algorithms for Alignment of Genome Sequences

Edit Distance Model (2)

Given: x, y

Define: F(i,j) = Score of best alignment ofx1…xi to y1…yj

Recurrence: F(i,j) = max ( F(i-1,j) – GAP_PENALTY,

F(i,j-1) – GAP_PENALTY,F(i-1,j-1) + SCORE(xi, yj))

Page 5: Algorithms for Alignment of Genome Sequences

Edit Distance Model (3)

F(i,j) = Score of best alignment ending at i,j

Time O( n2 ) for two seqs, O( nk ) for k seqs

F(i,j)

F(i,j-1)F(i-1,j-1)

F(i-1,j)

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

Page 6: Algorithms for Alignment of Genome Sequences

Overview• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

Page 7: Algorithms for Alignment of Genome Sequences

Local Alignment

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

F(i,j) = max (F(i,j), 0)

Return all paths with a position i,j where

F(i,j) > C

Time O( n2 ) for two seqs, O( nk ) for k seqs

Page 8: Algorithms for Alignment of Genome Sequences

Heuristic Local Alignment

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

BLAST FASTA

Page 9: Algorithms for Alignment of Genome Sequences

CHAOS: CHAins Of Seeds

1. Find short matching words (seeds)

2. Chain them

3. Rescore chain

Page 10: Algorithms for Alignment of Genome Sequences

CHAOS: Chaining the Seeds

• Find seeds at current location in seq1

locationin seq1

seedseq1

seq2

Page 11: Algorithms for Alignment of Genome Sequences

CHAOS: Chaining the Seeds

locationin seq1

distancecutoff

seedseq1

seq2

• Find seeds at current location in seq1

Page 12: Algorithms for Alignment of Genome Sequences

CHAOS: Chaining the Seeds

locationin seq1

distancecutoff

gapcutoff

seedseq1

seq2

• Find seeds at current location in seq1

Page 13: Algorithms for Alignment of Genome Sequences

CHAOS: Chaining the Seeds

• Find seeds at current location in seq1• Find the previous seeds that fall into the search box

locationin seq1

distancecutoff

gapcutoff

seed

Search box

seq1

seq2

Page 14: Algorithms for Alignment of Genome Sequences

CHAOS: Chaining the Seeds

• Find seeds at current location in seq1• Find the previous seeds that fall into the search box• Do a range query: seeds are indexed by their diagonal

locationin seq1

distancecutoff

gapcutoff

seed

Search box

seq1

seq2

Range of search

Page 15: Algorithms for Alignment of Genome Sequences

CHAOS: Chaining the Seeds

• Find seeds at current location in seq1• Find the previous seeds that fall into the search box• Do a range query: seeds are indexed by their diagonal.• Pick a previous seed that maximizes the score of chain location

in seq1

distancecutoff

gapcutoff

seed

Search box

seq1

seq2

Range of search

Page 16: Algorithms for Alignment of Genome Sequences

CHAOS: Chaining the Seeds

• Find seeds at current location in seq1• Find the previous seeds that fall into the search box• Do a range query: seeds are indexed by their diagonal.• Pick a previous seed that maximizes the score of chain location

in seq1

distancecutoff

gapcutoff

seed

Search box

seq1

seq2

Range of search

Time O(n log n), where n is number of seeds.

Page 17: Algorithms for Alignment of Genome Sequences

CHAOS Scoring

• Initial score = # matching bp - gaps

• Rapid rescoring: extend all seeds to find optimal location for gaps

Page 18: Algorithms for Alignment of Genome Sequences

Overview• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

Page 19: Algorithms for Alignment of Genome Sequences

Global AlignmentAGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

x

y

z

Page 20: Algorithms for Alignment of Genome Sequences

LAGAN: 1. FIND Local Alignments

1. Find Local Alignments

2. Chain Local Alignments

3. Restricted DP

Page 21: Algorithms for Alignment of Genome Sequences

LAGAN: 2. CHAIN Local Alignments

1. Find Local Alignments

2. Chain Local Alignments

3. Restricted DP

Page 22: Algorithms for Alignment of Genome Sequences

LAGAN: 3. Restricted DP

1. Find Local Alignments

2. Chain Local Alignments

3. Restricted DP

Page 23: Algorithms for Alignment of Genome Sequences

MLAGAN: 1. Progressive Alignment

Given N sequences, phylogenetic tree

Align pairwise, in order of the tree (LAGAN)

Human

Baboon

Mouse

Rat

Page 24: Algorithms for Alignment of Genome Sequences

MLAGAN: 2. Multi-anchoring

XZ

YZ

X/Y

Z

To anchor the (X/Y), and (Z) alignments:

Page 25: Algorithms for Alignment of Genome Sequences

Cystic Fibrosis (CFTR), 12 species

• Human sequence length: 1.8 Mb• Total genomic sequence: 13 Mb

HumanBaboon Cat Dog

Cow Pig

MouseRat

Chimp Chicken

FugufishZebrafish

Page 26: Algorithms for Alignment of Genome Sequences

CFTR (cont’d )

9055099.7%MammalsLAGAN

9086296%Chicken & Fishes

Chicken & Fishes

Mammals6704547

99.8%MLAGAN

98%

MAX MEMORY

(Mb)TIME (sec)

% Exons Aligned

Page 27: Algorithms for Alignment of Genome Sequences

Automatic computational system for Automatic computational system for comparative analysis of pairs of genomescomparative analysis of pairs of genomes http://pipeline.lbl.gov

Alignments (all pair combinations):

Human Genome (Golden Path Assembly)Mouse assemblies: Arachne, Phusion (2001) MGSC v3 (2002)Rat assemblies: January 2003, February 2003----------------------------------------------------------D. Melanogaster vs D. Pseudoobscura February 2003

Page 28: Algorithms for Alignment of Genome Sequences

Tandem Local/Global Approach•Finding a likely mapping for a contig (BLAT)

Page 29: Algorithms for Alignment of Genome Sequences

Progressive Alignment Scheme

yes

no yes no

Human, Mouse and Rat genomes

Pairwise M/R mapping

Aligned M&R fragments Unaligned M&R sequences

Map to Human GenomeMapping aligned fragments by union of M&R local BLAT hits on the human genome

H/M/R MLAGAN alignment

M/R pairwise alignment

M/H and R/H pairwise

alignment

Unassigned M&R DNA fragments

yes no

Page 30: Algorithms for Alignment of Genome Sequences

Computational Time

23 dual 2.2GHz Intel Xeon node PC cluster.

Pair-wise rat/mouse – 4 hours Pair-wise rat/human and mouse/human – 2 hours Multiple human/mouse/rat – 9 hours

Total wall time: ~ 15 hours

Page 31: Algorithms for Alignment of Genome Sequences

Distribution of Large Indels

0

20

40

60

80

100

120

140

160

180

200

100 150 200 250 300 350 400 450 500 550Indel length

Coun

t

Page 32: Algorithms for Alignment of Genome Sequences

Evolution Over a Chromosome

Page 33: Algorithms for Alignment of Genome Sequences

Overview• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

Page 34: Algorithms for Alignment of Genome Sequences

Evolution at the DNA level

…ACGGTGCAGTTACCA…

…AC----CAGTCCACCA…

Mutation

SEQUENCE EDITS

REARRANGEMENTS

Deletion

InversionTranslocationDuplication

Page 35: Algorithms for Alignment of Genome Sequences

Local & Global Alignment

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

Local Global

Page 36: Algorithms for Alignment of Genome Sequences

Glocal Alignment ProblemFind least cost transformation of one sequence into another using new operations •Sequence

edits•Inversions•Translocations•Duplications•Combinations of above

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTC

Page 37: Algorithms for Alignment of Genome Sequences

Shuffle-LAGAN

A glocal aligner for long DNA sequences

Page 38: Algorithms for Alignment of Genome Sequences

S-LAGAN: Find Local Alignments

1. Find Local Alignments

2. Build Rough Homology Map

3. Globally Align Consistent Parts

Page 39: Algorithms for Alignment of Genome Sequences

S-LAGAN: Build Homology Map

1. Find Local Alignments

2. Build Rough Homology Map

3. Globally Align Consistent Parts

Page 40: Algorithms for Alignment of Genome Sequences

Building the Homology Map

da b

c

Chain (using Eppstein

Galil); each alignment

gets a score which is

MAX over 4 possible

chains.Penalties are affine (event and distance components)

Penalties: a) regularb) translocatio

n

c) inversiond) inverted

translocation

Page 41: Algorithms for Alignment of Genome Sequences

S-LAGAN: Build Homology Map

1. Find Local Alignments

2. Build Rough Homology Map

3. Globally Align Consistent Parts

Page 42: Algorithms for Alignment of Genome Sequences

S-LAGAN: Global Alignment

1. Find Local Alignments

2. Build Rough Homology Map

3. Globally Align Consistent Parts

Page 43: Algorithms for Alignment of Genome Sequences

S-LAGAN Results (CFTR)Local

Glocal

Page 44: Algorithms for Alignment of Genome Sequences

S-LAGAN Results (CFTR)Hum/MusHum/Rat

Page 45: Algorithms for Alignment of Genome Sequences

S-LAGAN Results (IGF cluster)

Page 46: Algorithms for Alignment of Genome Sequences

S-LAGAN results (HOX)• 12 paralogous genes• Conserved order in mammals

Page 47: Algorithms for Alignment of Genome Sequences

S-LAGAN results (HOX)• 12 paralogous genes• Conserved order in mammals

Page 48: Algorithms for Alignment of Genome Sequences

S-LAGAN Results (Chr 20)

• Human Chr 20 v. homologous Mouse Chr 2.• 270 Segments of conserved synteny• 70 Inversions

Page 49: Algorithms for Alignment of Genome Sequences

S-LAGAN Results (Whole Genome)

LAGAN S-LAGANTotal 37% 38%Exon 93% 96%Ups200 78% 81%CPU Time

350 Hrs 450 Hrs

• Used Berkeley Genome Pipeline• % Human genome aligned with mouse sequence• Evaluation criteria from Waterston, et al

(Nature 2002)

Page 50: Algorithms for Alignment of Genome Sequences

Rearrangements in Human v. Mouse

Preliminary conclusions:• Rearrangements come in all sizes• Duplications worse conserved than other rearranged regions• Simple inversions tend to be most common and most conserved

Page 51: Algorithms for Alignment of Genome Sequences

What is next? (Shuffle)

• Better algorithm and scoring

• Whole genome synteny mapping

• Multiple Glocal Alignment(!?)

Page 52: Algorithms for Alignment of Genome Sequences

Overview• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

Page 53: Algorithms for Alignment of Genome Sequences

Biological Story

• Math1 (Mouse Atonal Homologue 1, also ATOH) is a gene that is responsible for nervous system development

Page 54: Algorithms for Alignment of Genome Sequences

Align Human, Mouse, Rat & Fugu

Page 55: Algorithms for Alignment of Genome Sequences

Detailed Alignment

hum_a : CAATAGAGGGTCTGGCAGAGGCTC---------------------CTGGC @ 57336/400001mus_a : CAATAGAGGGGCTGGCAGAGGCTC---------------------CTGGC @ 78565/400001rat_a : CAATAGAGGGGCTGGCAGAGACTC---------------------CTGGC @ 112663/369938fug_a : TGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCGTGGGC @ 36013/68174

hum_a : CGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 57386/400001mus_a : CCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 78615/400001rat_a : CCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGC @ 112713/369938fug_a : CGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCGAGCGC @ 36063/68174

Page 56: Algorithms for Alignment of Genome Sequences

Can we align human & fly???

CGCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG GAGGTGTTGGATGGCCTGAGTGA-AGCACGCGCTGTCAGCTGGCGAGCGCTCGCG-AGTCCCTGCCGTGTCCCCGMelan GCTACTCCAGCT-ACCACCTGCATGCAGCTGCACAGCPseudo GCCACTGAGACT-GCCACCTGCATGCAGCTGCACAGA

Page 57: Algorithms for Alignment of Genome Sequences

Putting it all together

CGCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAGCCCGGTGC-GGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTGAGCGCACTCG-CTTTCAGGCAGCTCCCCGGGGAG GAGGTGTTGGATGGCCTGAGTGA-AGCACGCGCTGTCAGCTGGCGAGCGCTCGCG-AGTCCCTGCCGTGTCCCCGMelan GCTACTCCAGCT-ACCACCTGCATGCAGCTGCACAGCPseudo GCCACTGAGACT-GCCACCTGCATGCAGCTGCACAGA

Page 58: Algorithms for Alignment of Genome Sequences

Overview• Local Alignment (CHAOS)

• Multiple Global Alignment (LAGAN)- Whole Genome Alignment

• Glocal Alignment (Shuffle-LAGAN)

• Biological Story

Page 59: Algorithms for Alignment of Genome Sequences

AcknowledgmentsStanford:Serafim BatzoglouArend SidowMatt Scott

Gregory Cooper Chuong (Tom) DoSanket MaldeKerrin SmallMukund Sundararajan

Berkeley: Inna DubchakAlexander Poliakov

Göttingen:Burkhard Morgenstern

Rat Genome Sequencing Consortium

http://lagan.stanford.edu/


Recommended