+ All Categories
Home > Documents > Alignment of large genomic sequences

Alignment of large genomic sequences

Date post: 29-Jan-2016
Category:
Upload: brigid
View: 50 times
Download: 0 times
Share this document with a friend
Description:
Alignment of large genomic sequences. Fragment-based alignment approach ( DIALIGN ) useful for alignment of genomic sequences. Possible applications: Detection of regulatory elements Identification of pathogenic microorganisms Gene prediction. The DIALIGN approach. - PowerPoint PPT Presentation
Popular Tags:
158
Alignment of large genomic sequences Fragment-based alignment approach (DIALIGN) useful for alignment of genomic sequences. Possible applications: Detection of regulatory elements Identification of pathogenic microorganisms Gene prediction
Transcript
Page 1: Alignment of large  genomic sequences

Alignment of large genomic sequences

Fragment-based alignment approach (DIALIGN) useful for alignment of genomic sequences.

Possible applications: Detection of regulatory elements Identification of pathogenic microorganisms Gene prediction

Page 2: Alignment of large  genomic sequences

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Page 3: Alignment of large  genomic sequences

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Page 4: Alignment of large  genomic sequences

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Page 5: Alignment of large  genomic sequences

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Page 6: Alignment of large  genomic sequences

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Page 7: Alignment of large  genomic sequences

The DIALIGN approach

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Page 8: Alignment of large  genomic sequences

The DIALIGN approach

atc------taatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Page 9: Alignment of large  genomic sequences

The DIALIGN approach

atc------taatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaa--gagtatcacccctgaattgaataa

Page 10: Alignment of large  genomic sequences

The DIALIGN approach

atc------taatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaa--gagtatcacc----------cctgaattgaataa

Page 11: Alignment of large  genomic sequences

The DIALIGN approach

atc------taatagttaaactcccccgtgc-ttag

cagtgcgtgtattactaac----------gg-ttcaatcgcg

caaa--gagtatcacc----------cctgaattgaataa

Page 12: Alignment of large  genomic sequences

The DIALIGN approach

atc------taatagttaaactcccccgtgc-ttag

cagtgcgtgtattactaac----------gg-ttcaatcgcg

caaa--gagtatcacc----------cctgaattgaataa

Consistency!

Page 13: Alignment of large  genomic sequences

The DIALIGN approach

atc------TAATAGTTAaactccccCGTGC-TTag

cagtgcGTGTATTACTAAc----------GG-TTCAATcgcg

caaa--GAGTATCAcc----------CCTGaaTTGAATaa

Page 14: Alignment of large  genomic sequences

First step in sequence comparison: alignment

S1 S2S3

Page 15: Alignment of large  genomic sequences

First step in sequence comparison: alignment

For genomic sequences:Neither local nor global methods appropriate

S1 S2

S1’ S2’ S3’

S3

Page 16: Alignment of large  genomic sequences

First step in sequence comparison: alignment

Local method finds single best local similarity

S1 S2

S1’ S2’ S3’

S3

Page 17: Alignment of large  genomic sequences

First step in sequence comparison: alignment

Multiple application of local methods possible

S1 S2

S1’ S2’ S3’

S3

Page 18: Alignment of large  genomic sequences

First step in sequence comparison: alignment

S1 S2

S1’ S2’ S3’

S3

Multiple application of local methods possible

Page 19: Alignment of large  genomic sequences

First step in sequence comparison: alignment

Multiple application of local methods possible

S1 S2

S1’ S2’ S3’

S3

Page 20: Alignment of large  genomic sequences

First step in sequence comparison: alignment

Multiple application of local methods possible

S1 S2

S1’ S2’ S3’

S3

Page 21: Alignment of large  genomic sequences

First step in sequence comparison: alignment

Threshold has to be applied to filter alignments: reduced sensitivity!

S1 S2

S1’ S2’ S3’

S3

Page 22: Alignment of large  genomic sequences

First step in sequence comparison: alignment

Alternative approach:

During evolution few large-scale re-arrangements

-> relative order homologies conserved

Search for chain of local homologies

Page 23: Alignment of large  genomic sequences

First step in sequence comparison: alignment

Genomic alignment: chain of homologies

S1 S2

S1’ S2’ S3’

S3

Page 24: Alignment of large  genomic sequences

First step in sequence comparison: alignment

Genomic alignment: chain of homologies

S1 S2

S1’ S2’ S3’

S3

Page 25: Alignment of large  genomic sequences

First step in sequence comparison: alignment

Genomic alignment: chain of homologies

S1 S2

S1’ S2’ S3’

S3

Page 26: Alignment of large  genomic sequences

First step in sequence comparison: alignment

Genomic alignment: chain of homologies

S1 S2

S1’ S2’ S3’

S3

Page 27: Alignment of large  genomic sequences

First step in sequence comparison: alignment

Novel approaches for genomic alignment:

WABA PipMaker MGA TBA Lagan Avid DIALIGN

Page 28: Alignment of large  genomic sequences

Alignment of large genomic sequences

Gene-regulatory sites identified by mulitple sequence alignment (phylogenetic footprinting)

Page 29: Alignment of large  genomic sequences

Alignment of large genomic sequences

Page 30: Alignment of large  genomic sequences

Objective function for DIALIGN:

Weight score for every possible fragment f based on P-value: P(f) = probability of finding a fragment “like f” by chance

in random sequences with same length as input sequences

w(f) = -log P(f) (“weight score” of f)

”like f” means: at least same # matches (DNA, RNA) or sum of similarity values (proteins)

Page 31: Alignment of large  genomic sequences

Objective function for DIALIGN:

Score of alignment: sum of weight scores of fragments – no gap penalty!

Page 32: Alignment of large  genomic sequences

Optimization problem for DIALIGN:

Find consistent collection of fragments with maximum total weight score!

Page 33: Alignment of large  genomic sequences

Alternative fragment weight scores for genomic sequences:

Calculate fragment scores at nucleotide level and at peptide level.

Page 34: Alignment of large  genomic sequences

catcatatcttatcttacgttaactcccccgt

cagtgcgtgatagcccatatccgg

Page 35: Alignment of large  genomic sequences

catcatatcttatcttacgttaactcccccgt

cagtgcgtgatagcccatatccgg

Page 36: Alignment of large  genomic sequences

catcatatcttatcttacgttaactcccccgt

cagtgcgtgatagcccatatccgg

Standard score:Consider length, # matches, compute probability of random occurrence

Page 37: Alignment of large  genomic sequences

Translation option:

catcatatcttatcttacgttaactcccccgt

cagtgcgtgatagcccatatccgg

Page 38: Alignment of large  genomic sequences

Translation option:

L S Y V

catcatatc tta tct tac gtt aactcccccgt

cagtgcgtg ata gcc cat atc cgg

I A H I

DNA segments translated to peptide segments; fragment score based on peptide similarity:

Calculate probability of finding a fragment of the same length with (at least) the same sum of BLOSUM values

Page 39: Alignment of large  genomic sequences

P-fragment (in both orientations)

L S Y V

catcatatc tta tct tac gtt aactcccccgt

cagtgcgtg ata gcc cat atc cgg

I A H I

N-fragment catcatatc ttatcttacgtt aactcccccgtgct || | | | cagtgcgtg atagcccatatc cg

For each fragment f three probability values calculated; Score of f based on smallest P value.

Page 40: Alignment of large  genomic sequences

Alternative fragment weight scores for genomic sequences:

Calculate fragment scores at nucleotide level and at peptide level.

Page 41: Alignment of large  genomic sequences

DIALIGN alignment of human and murine genomic sequences

Page 42: Alignment of large  genomic sequences

DIALIGN alignment of tomato and Thaliana genomic sequences

Page 43: Alignment of large  genomic sequences

Alignment of large genomic sequences

Evaluation of signal detection methods: Apply method to data with known signals (correct answer is known!). E.g. experimentally verified genes for gene finding

TP = true positves = # signals correctly predicted (i.e. signal present)

FP = false positives = # signals predicted but wrong (i.e no signal present)

TN = true negative = # no signal predicted, no signal present

FN = false negative = # no signal predicted, signal present!

Page 44: Alignment of large  genomic sequences

Alignment of large genomic sequences

Sn = Sensitivity

= correctly predicted signals / present signals

= TP / (TP + FN)

Sp = Specificity

= correctly predicted signals / predicted signals = TP / (TP + FP)

Page 45: Alignment of large  genomic sequences

Alignment of large genomic sequences

Comprehensive evaluation of signal prediction method:

Method assigns score to predictions

Apply threshold parameter

High threshold -> high specificity (Sp), low sensitivity (Sn)

Low threshold -> high sensitivity , low specificity

ROC curve („receiver-operator curve“)

Vary threshold parameter, plot Sn against Sp

Page 46: Alignment of large  genomic sequences

Performance of long-range alignment programs for exon discovery (human - mouse comparison)

Page 47: Alignment of large  genomic sequences

DIALIGN alignment of tomato and Thaliana genomic sequences

Page 48: Alignment of large  genomic sequences

AGenDA:

Alignment-based Gene Detection Algorithm

Bridge small gaps between DIALIGN fragments

-> cluster of fragments

Page 49: Alignment of large  genomic sequences

AGenDA:

Alignment-based Gene Detection Algorithm

Bridge small gaps between DIALIGN fragments

-> cluster of fragments

Search conserved splice sites and start/stop codons at cluster boundaries to Identify candidate exons

Page 50: Alignment of large  genomic sequences

AGenDA:

Alignment-based Gene Detection Algorithm

Bridge small gaps between DIALIGN fragments

-> cluster of fragments

Search conserved splice sites and start/stop codons at cluster boundaries to Identify candidate exons

Recursive algorithm finds biologically consistent chain of potential exons

Page 51: Alignment of large  genomic sequences

Identification of candidate exons

Fragments in DIALIGN alignment

Page 52: Alignment of large  genomic sequences

Identification of candidate exons

Build cluster of fragments

Page 53: Alignment of large  genomic sequences

Identification of candidate exons

Identify conserved splice sites

Page 54: Alignment of large  genomic sequences

Identification of candidate exons

Candidate exons bounded by conserved splice sites

Page 55: Alignment of large  genomic sequences

Construct gene models using candidate exons

Score of candidate exon (E) based on DIALIGN scores for fragments, score of splice junctions and penalty for shortening / extending

Find biologically consistent chain of candidate exons (starting with start codon, ending with stop codon, no internal stop codons …) with maximal total score

)()()(

),()()( SPscfw

Clen

ECdisClenEsc

i

i

Page 56: Alignment of large  genomic sequences

Find optimal consistent chain of candidate exons

Page 57: Alignment of large  genomic sequences

Find optimal consistent chain of candidate exons

Page 58: Alignment of large  genomic sequences

Find optimal consistent chain of candidate exons

Page 59: Alignment of large  genomic sequences

Find optimal consistent chain of candidate exons

Page 60: Alignment of large  genomic sequences

Find optimal consistent chain of candidate exons

atg gt ag gt ag tga atg tga

Page 61: Alignment of large  genomic sequences

Find optimal consistent chain of candidate exons

atg gt ag gt ag tga atg tga

G1 G2

Page 62: Alignment of large  genomic sequences

Find optimal consistent chain of candidate exons

Recursive algorithm calculates optimal chain of candidate exons in O(N log N) time

Page 63: Alignment of large  genomic sequences

Find optimal consistent chain of candidate exons

atg gt ag gt ag tga atg tga

G1 G2

Page 64: Alignment of large  genomic sequences

Find optimal consistent chain of candidate exons

Recursive algorithm calculates optimal chain of candidate exons in O(N log N) time

Page 65: Alignment of large  genomic sequences

DIALIGN fragments

Page 66: Alignment of large  genomic sequences

Candidate exons

Page 67: Alignment of large  genomic sequences

Gene model

Page 68: Alignment of large  genomic sequences

Result: 105 pairs of genomic sequences from human and mouse (Batzoglou et al., 2000)

Page 69: Alignment of large  genomic sequences

Result: 105 pairs of genomic sequences from human and mouse (Batzoglou et al., 2000)

0%10%20%30%40%50%60%70%80%90%

100%

sensitivity specificity

AGenDAGenScan

Page 70: Alignment of large  genomic sequences

AGenDA

GenScan

64 %

12 % 17 %

Result: 105 pairs of genomic sequences from human and mouse (Batzoglou et al., 2000)

Page 71: Alignment of large  genomic sequences

Alignment of large genomic sequences

DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004)

Page 72: Alignment of large  genomic sequences

Alignment of large genomic sequences

DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004)

Alignment of Hox gene cluster:

Page 73: Alignment of large  genomic sequences

Alignment of large genomic sequences

DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004)

Alignment of Hox gene cluster:

DIALIGN able to identify small regulatory elements, but

Page 74: Alignment of large  genomic sequences

Alignment of large genomic sequences

DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004)

Alignment of Hox gene cluster:

DIALIGN able to identify small regulatory elements, but

Entire genes totally mis-aligned

Page 75: Alignment of large  genomic sequences

Alignment of large genomic sequences

DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004)

Alignment of Hox gene cluster:

DIALIGN able to identify small regulatory elements, but

Entire genes totally mis-aligned Reason for mis-alignment: duplications !

Page 76: Alignment of large  genomic sequences

Alignment of large genomic sequences

The Hox gene cluster:

4 Hox gene clusters in pufferfish. 14 genes, different genes in different clusters!

Page 77: Alignment of large  genomic sequences

Alignment of large genomic sequences

The Hox gene cluster:

Complete mis-alignment of entire genes!

Page 78: Alignment of large  genomic sequences

Alignment of sequence duplications

S1

S2

Page 79: Alignment of large  genomic sequences

Alignment of sequence duplications

S1

S2

Conserved motivs; no similarity outside motifs

Page 80: Alignment of large  genomic sequences

Alignment of sequence duplications

S1

S2

Duplication in two sequences

Page 81: Alignment of large  genomic sequences

Alignment of sequence duplications

S1

S2

Duplication in two sequences

Page 82: Alignment of large  genomic sequences

Alignment of sequence duplications

S1

S2

Duplication in two sequences

Page 83: Alignment of large  genomic sequences

Alignment of sequence duplications

S1

S2

Mis-alignment would have lower score!

Page 84: Alignment of large  genomic sequences

Alignment of sequence duplications

S1

S2

Duplication in one sequence

Page 85: Alignment of large  genomic sequences

Alignment of sequence duplications

S1

S2

Duplication in one sequence

Page 86: Alignment of large  genomic sequences

Alignment of sequence duplications

S1

S2

Duplication in one sequence

Possible mis-alignment

Page 87: Alignment of large  genomic sequences

Alignment of sequence duplications

S1

S2

Duplication in one sequence

S3

Page 88: Alignment of large  genomic sequences

Alignment of sequence duplications

S1

S2

Duplication in one sequence

S3

Page 89: Alignment of large  genomic sequences

Alignment of sequence duplications

S1

S2

Duplication in one sequence

S3

Page 90: Alignment of large  genomic sequences

Alignment of sequence duplications

S1

S2

Duplication in one sequence

S3

Page 91: Alignment of large  genomic sequences

Alignment of sequence duplications

S1

S2

Consistency problem

S3

Page 92: Alignment of large  genomic sequences

Alignment of sequence duplications

S1

S2

More plausible alignment – and higher score:

S3

Page 93: Alignment of large  genomic sequences

Alignment of sequence duplications

S1

S2

Consistency problem

S3

Page 94: Alignment of large  genomic sequences

Alignment of sequence duplications

S1

S2

Alternative alignment; probably biologically wrong;lower numerical score!

S3

Page 95: Alignment of large  genomic sequences

Anchored sequence alignment

Biologically meaningful alignment often not possible by automated approaches.

Page 96: Alignment of large  genomic sequences

Anchored sequence alignment

Biologically meaningful alignment not possible by automated approaches.

Idea: use expert knowledge to guide alignment procedure

Page 97: Alignment of large  genomic sequences

Anchored sequence alignment

Biologically meaningful alignment not possible by automated approaches.

Idea: use expert knowledge to guide alignment procedure

User defines a set anchor points that are to be „respected“ by the alignment procedure

Page 98: Alignment of large  genomic sequences

Anchored sequence alignment

NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN

IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT

Page 99: Alignment of large  genomic sequences

Anchored sequence alignment

NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN

IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT

Page 100: Alignment of large  genomic sequences

Anchored sequence alignment

NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN

IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT

Use known homology as anchor point

Page 101: Alignment of large  genomic sequences

Anchored sequence alignment

NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN

IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT

Use known homology as anchor point

Page 102: Alignment of large  genomic sequences

Anchored sequence alignment

NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN

IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT

Use known homology as anchor point

Anchor point = anchored fragment (gap-free pair of segments)

Page 103: Alignment of large  genomic sequences

Anchored sequence alignment

NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN

IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT

Use known homology as anchor point

Anchor point = anchored fragment (gap-free pair of segments)

Remainder of sequences aligned automatically

Page 104: Alignment of large  genomic sequences

Anchored sequence alignment

-------NLF VALYDFVASG DNTLSITKGE klrvlgynhn

iihredkGVI YALWDYEPQN DDELPMKEGD cmt-------

Anchored alignment

Page 105: Alignment of large  genomic sequences

Anchored sequence alignment

NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN

IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT

GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS

Anchor points in multiple alignment

Page 106: Alignment of large  genomic sequences

Anchored sequence alignment

NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN

IIHREDKGVIYALWDYEPQND DELPMKEGDCMT

GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS

Anchor points in multiple alignment

Page 107: Alignment of large  genomic sequences

Anchored sequence alignment

-------NLF V-ALYDFVAS GD-------- NTLSITKGEk lrvLGYNhn

iihredkGVI Y-ALWDYEPQ ND-------- DELPMKEGDC MT-------

-------GYQ YrALYDYKKE REedidlhlg DILTVNKGSL VA-LGFS--

Anchored multiple alignment

Page 108: Alignment of large  genomic sequences

Algorithmic questions

Goal:

Find optimal alignment (=consistent set of fragments) under costraints given by user-specified anchor points!

Page 109: Alignment of large  genomic sequences

Additional input file with anchor points:

1 3 215 231 5 4.5

2 3 34 78 23 1.23

1 4 317 402 8 8.5

Algorithmic questions

Page 110: Alignment of large  genomic sequences

Algorithmic questions

NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMTGYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS

Page 111: Alignment of large  genomic sequences

Additional input file with anchor points:

1 3 215 231 5 4.5

2 3 34 78 23 1.23

1 4 317 402 8 8.5

Algorithmic questions

Page 112: Alignment of large  genomic sequences

Additional input file with anchor points:

1 3 215 231 5 4.5

2 3 34 78 23 1.23

1 4 317 402 8 8.5

Sequences

Algorithmic questions

Page 113: Alignment of large  genomic sequences

Additional input file with anchor points:

1 3 215 231 5 4.5

2 3 34 78 23 1.23

1 4 317 402 8 8.5

Sequences start positions

Algorithmic questions

Page 114: Alignment of large  genomic sequences

Additional input file with anchor points:

1 3 215 231 5 4.5

2 3 34 78 23 1.23

1 4 317 402 8 8.5

Sequences start positions length

Algorithmic questions

Page 115: Alignment of large  genomic sequences

Additional input file with anchor points:

1 3 215 231 5 4.5

2 3 34 78 23 1.23

1 4 317 402 8 8.5

Sequences start positions length score

Algorithmic questions

Page 116: Alignment of large  genomic sequences

Algorithmic questions

Requirements:

Anchor points need to be consistent! – if necessary: select consistent subset from user-specified anchor points

Page 117: Alignment of large  genomic sequences

Algorithmic questions

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Page 118: Alignment of large  genomic sequences

Algorithmic questions

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Page 119: Alignment of large  genomic sequences

Algorithmic questions

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Inconsistent anchor points!

Page 120: Alignment of large  genomic sequences

Algorithmic questions

atctaat---agttaaactcccccgtgcttag

Cagtgcgtgtattac-taacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Inconsistent anchor points!

Page 121: Alignment of large  genomic sequences

Algorithmic questions

Requirements:

Anchor points need to be consistent! – if necessary: select consistent subset from user-specified anchor points

Page 122: Alignment of large  genomic sequences

Algorithmic questions

Requirements:

Anchor points need to be consistent! – if necessary: select consistent subset from user-specified anchor points

Find alignment under constraints given by anchor points!

Page 123: Alignment of large  genomic sequences

Algorithmic questions

Use data structures from multiple alignment

Page 124: Alignment of large  genomic sequences

Algorithmic questions

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Page 125: Alignment of large  genomic sequences

Algorithmic questions

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Greedy procedure for multiple alignment

Page 126: Alignment of large  genomic sequences

Algorithmic questions

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Greedy procedure for multiple alignment

Page 127: Alignment of large  genomic sequences

Algorithmic questions

atctaatagttaaactcccccgtgcttag

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa

Question: which positions are still alignable ?

Page 128: Alignment of large  genomic sequences

Algorithmic questions

atctaatagttaaactcccccgtgcttag Si

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa x

For each position x and each sequence Si exist an

upper bound ub(x,i) and a lower bound lb(x,i) for

residues y in Si that are alignable with x

Page 129: Alignment of large  genomic sequences

Algorithmic questions

atctaatagttaaactcccccgtgcttag Si

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa x

For each position x and each sequence Si exist an

upper bound ub(x,i) and a lower bound lb(x,i) for

residues y in Si that are alignable with x

Page 130: Alignment of large  genomic sequences

Algorithmic questions

atctaatagttaaactcccccgtgcttag Si

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa x

ub(x,i) and lb(x,i) updated during greedy procedure

Page 131: Alignment of large  genomic sequences

Algorithmic questions

atctaatagttaaactcccccgtgcttag Si

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa x

Initial values of lb(x,i), ub(x,i)

Page 132: Alignment of large  genomic sequences

Algorithmic questions

atctaatagttaaactcccccgtgcttag Si

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa x

ub(x,i) and lb(x,i) updated during greedy procedure

Page 133: Alignment of large  genomic sequences

Algorithmic questions

atctaatagttaaactcccccgtgcttag Si

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa x

ub(x,i) and lb(x,i) updated during greedy procedure

Page 134: Alignment of large  genomic sequences

Algorithmic questions

Anchor points treated like fragments in greedy algorithm:

Page 135: Alignment of large  genomic sequences

Algorithmic questions

Anchor points treated like fragments in greedy algorithm:

Sorted according to user-defined scores

Page 136: Alignment of large  genomic sequences

Algorithmic questions

Anchor points treated like fragments in greedy algorithm:

Sorted according to user-defined scores Accepted if consistent with previously

accepted anchors

Page 137: Alignment of large  genomic sequences

Algorithmic questions

Anchor points treated like fragments in greedy algorithm:

Sorted according to user-defined scores Accepted if consistent with previously

accepted anchors

ub(x,i) and lb(x,i) updated during greedy

procedure

Page 138: Alignment of large  genomic sequences

Algorithmic questions

Anchor points treated like fragments in greedy algorithm:

Sorted according to user-defined scores Accepted if consistent with previously accepted

anchors

ub(x,i) and lb(x,i) updated during greedy

procedure

Resulting values of ub(x,i) and lb(x,i) used as initial

values for alignment procedure

Page 139: Alignment of large  genomic sequences

Algorithmic questions

atctaatagttaaactcccccgtgcttag Si

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa x

Initial values of lb(x,i), ub(x,i)

Page 140: Alignment of large  genomic sequences

Algorithmic questions

atctaatagttaaactcccccgtgcttag Si

cagtgcgtgtattactaacggttcaatcgcg

caaagagtatcacccctgaattgaataa x

Initial values of lb(x,i), ub(x,i) calculated using anchor

points

Page 141: Alignment of large  genomic sequences

Algorithmic questions

Ranking of anchor points to prioritize anchor points, e.g.

anchor points from verified homologies -- higher priority

automatically created anchor points (using CHAOS, BLAST, … ) -- lower priority

Page 142: Alignment of large  genomic sequences

Application: Hox gene cluster

Page 143: Alignment of large  genomic sequences

Application: Hox gene cluster

Use gene boundaries as anchor points

Page 144: Alignment of large  genomic sequences

Application: Hox gene cluster

Use gene boundaries as anchor points

+ CHAOS / BLAST hits

Page 145: Alignment of large  genomic sequences

Application: Hox gene cluster

no anchoring anchoring

Ali. Columns

2 seq 2958 3674

3 seq 668 1091

4 seq 244 195

Score 1166 1007

CPU time 4:22 0:19

Page 146: Alignment of large  genomic sequences

Application: Hox gene cluster

Example:

Teleost Hox gene cluster:

Page 147: Alignment of large  genomic sequences

Application: Hox gene cluster

Example:

Teleost Hox gene cluster:

Score of anchored alignment 15 % higher than score of non-anchored alignment !

Page 148: Alignment of large  genomic sequences

Application: Hox gene cluster

Example:

Teleost Hox gene cluster:

Score of anchored alignment 15 % higher than score of non-anchored alignment !

Conclusion: Greedy optimization algorithm does a bad job!

Page 149: Alignment of large  genomic sequences

Application: Improvement of Alignment programs

Two possible reasons for mis-alignments:

Page 150: Alignment of large  genomic sequences

Application: Improvement of Alignment programs

Two possible reasons for mis-alignments:

Wrong objective function: Biologically correct

alignment gets bad numerical score

Page 151: Alignment of large  genomic sequences

Application: Improvement of Alignment programs

Two possible reasons for mis-alignments:

Wrong objective function: Biologically correct

alignment gets bad numerical score

Bad optimization algorithms: Biologically correct

alignment gets best numerical score, but algorithm

fails to find this alignment

Page 152: Alignment of large  genomic sequences

Application: Improvement of Alignment programs

Two possible reasons for mis-alignments:

Anchored alignments can help to decide

Page 153: Alignment of large  genomic sequences

Application: RNA alignment

Page 154: Alignment of large  genomic sequences

Application: RNA alignment

aa----CCCC AGC---GUAa gucgcuaucc a

cacucuCCCA AGC---GGAG Aac------- -

ccg----CCA AaagauGGCG Acuuga---- -

non-anchored alignment

Page 155: Alignment of large  genomic sequences

Application: RNA alignment

aa----CCCC AGC---GUAa gucgcuaucc a

cacucuCCCA AGC---GGAG Aac------- -

ccg----CCA AaagauGGCG Acuuga---- -

structural motif mis-aligned

Page 156: Alignment of large  genomic sequences

Application: RNA alignment

aaCCCCAGCG UAAGUCGCUA UCca--

--CACUCUCC CAAGCGGAGA AC----

----CCGCCA AAAGAUGGCG ACuuga

3 conserved nucleotides as anchor points

Page 157: Alignment of large  genomic sequences

WWW interface at GOBICS(Göttingen Bioinformatics Compute Server)

Page 158: Alignment of large  genomic sequences

WWW interface at GOBICS (Göttingen Bioinformatics Compute Server)


Recommended