+ All Categories
Home > Documents > Master Course

Master Course

Date post: 26-Jan-2016
Category:
Upload: cid
View: 24 times
Download: 0 times
Share this document with a friend
Description:
Master Course. MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Dep. de Llenguatges i Sistemes Informàtics CEPBA-IBM Research Institute Universitat Politècnica de Catalunya. Master Course. - PowerPoint PPT Presentation
32
Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Dep. de Llenguatges i Sistemes Informàtics CEPBA-IBM Research Institute Universitat Politècnica de Catalunya
Transcript
Page 1: Master Course

Master Course

MSc Bioinformatics for Health Sciences

H15: Algorithms on strings and sequences

Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

Dep. de Llenguatges i Sistemes InformàticsCEPBA-IBM Research Institute

Universitat Politècnica de Catalunya

Page 2: Master Course

Master Course

Fourth lecture:

Sequence assembly

Page 3: Master Course

Sequence assembly

It is applied to the following topics:

• EST assembly

• DNA sequencing .

Page 4: Master Course

• Hibridization: provide information about l-tuples present in DNA.

DNA sequencing

There are two techniques:

• Shotgun: DNA sequences are broken into 100Kb-500Kb random fragments.

Page 5: Master Course

• Hibridization: provide information about l-mers present in DNA

DNA sequencing

There are two techniques:

• Shotgun: DNA sequences are broken into 100Kb-500Kb random fragments.

Page 6: Master Course

Hybridization

Let xxxxxxxxxxxxx be the sequence we want to know,

and the hybridization technique gives us the set of 3-mers that belong to it:

AAC GAT TGCACG CGG GCC TTG GGA ATT

How can the sequence be reconstructed?

Page 7: Master Course

Hybridization

As AAC and ACG belong to the sequence,

then AACG belongs to the sequence,

AAC GAT TGCACG CGG GCC TTG GGA ATT

Given the 3-mers of the sequence:

because the longest (proper) suffix of AAC matches the longest (proper) prefix of ACG.

This relation can be represented with a directed graph AAC ACG

Page 8: Master Course

Hybridization

Construction of the complete suffix-prefix graph

AAC GAT TGC

ACG CGG GCC TTG

GGA ATT

AACGGATTGCC

that gives us the unknown sequence:

But, is this a realistic case?

Page 9: Master Course

Hybridization

Let us introduce a more realistic case:

and the sequence is given by the Hamiltonian path

Which is the cost of the hybridization method?

AAC CAA GAT TGC

ACG CGG GCC TTG

GGC GGA CCG ATT

and whose cost is NP-Complet!

that is the path that traverses all nodes exactly once

Page 10: Master Course

2. Searching for the suffix-prefix matches :

Hybridization: cost

Cost: 1. Finding the l-mers AAC, CAA, ACG,... :

There are 4L l-mers of length L that should be generated

If there are m L-mers, then there are O(m2 L2 ) comparisons

3. Searching for the Hamiltonian path

NP- Complet

Page 11: Master Course

Excursió: cost

Quadratic cost: O(m2 )

Linear cost: O(m)

Exponencial cost: O(2m )

m t = 1 mseg10m 10t = 10 mseg1000m 1000t = 1 seg

m t = 1mseg.10m 100t = 100 mseg.1000m 1000000t = 16 min

m t = 1 mseg.10m 210 t = 1 seg1000m 21000 t = 1030 t = 1018 anys

Page 12: Master Course

2. Searching for the suffix-prefix matches :

Hybridization: cost

Cost: 1. Finding the l-mers AAC, CAA, ACG,... :

There are 4L l-mers of length L that should be generated

If there are m L-mers, then there are O(m2 L2 ) comparisons

3. Searching for the Hamiltonian path

NP- Complet

How the NP-completness can be avoided?

Page 13: Master Course

Hybridization:

Search for the Hamiltonian path (NP-complet)

AAC GAT TGC

ACG CGG GCC TTG

GGC GGA CCG ATT

or search for the Eulerian path (lineal) AA

AC

GG

CG

GA

CC

GC

TG

TT

AT

Page 14: Master Course

Hybridization: Eulerian path

Unbalanced nodes: indegree = outdegree (Starting or ending nodes )

Balanced nodes: indegree = oudegree (traversed nodes: )

Search for the Eulerian path of the graph:

Page 15: Master Course

Hybridization: Eulerian path

Algorithm: 1. Construct a random path between starting and ending nodes. 2. Add cycles from balanced nodes while possible.

Page 16: Master Course

Hybridization: camí Eulerià

Algorithm: 1. Construct a random path between starting and ending nodes. 2. Add cycles from balanced nodes while possible.

Page 17: Master Course

2. Searching for the suffix-prefix matches :

Hybridization: cost

Cost: 1. Finding the l-mers AAC, CAA, ACG,... :

There are 4L l-mers of length L that should be generated

If there are m L-mers, then there are O(m2 L2 ) comparisons

3. Searching for the Eulerian path

Linear cost

Now, which is the limiting factor?

Page 18: Master Course

Hybridization: limiting factor

AAC CAA GAT TGC

ACG CGG GCC TTG

GGA ATT

Repeated l-mers:

Which is the probability of a repeat?

CAACGGATTGCC

CAACGGACGGATTGCC

GAC

Given the graph:

How many sequences can be assembled?

Page 19: Master Course

Hybridization: statistical model

Model: random sequence of length N with identically distributed bases (1/4),

How the probability of a repeat can be computed?

Given 2 l-mers, the probability to match is : 4-L

Given 3 l-mers, the expected number of 2-matches is : (32)4-L

Given m l-mers, the expected number of 2-matches is: (m2)4-L

If (m2)4-L <1 then m<sqr(2·4L) then for L = 8, m =512!

Conclusion: this technique can be applied only to short sequences.

Page 20: Master Course

Hybridization:

Connect to

http://alggen.lsi.upc.edu

And follow links RESEARCH SEARCH MREPATT

Genome sequences are close to random sequences?

Page 21: Master Course
Page 22: Master Course

• Hibridizationació: provide information about l-mers present in DNA

DNA sequencing

There are two techniques:

• Shot gun: DNA sequences are broken into 100Kb-500Kb random fragments.

Page 23: Master Course

Shotgun

With the unknown sequence xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

It is possible :

• to make some copies

• to break it into random and unsorted short segments

What can we do?

Page 24: Master Course

Shotgun: algorisme

Assume xxxxx|xxxxxxx|xxxxxxx|xxxx xxxxxxxx|xxxxxx|xxxxxx|xxxxxxx|xxxxxx|xxxxxx|xxxxxxx

The algorithm is:

1st. Compare all pairs searching for suffix-prefix approximate matches.

2nd. Construct the graph suffix-prefix

3th. Find the path

Page 25: Master Course

Shotgun

Given the three copies xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

The shotgun brokes it into the following segments

accgt, aggt, acgatac, accttta, tttaac, gataca, accgtacc, ggt, acaggt,taacgat, accg, tacctt

Page 26: Master Course

Shotgun

The pairwise comparison that searchs for suffix-prefix approximate matching can be done with:

• Dynamic programming ( quadratic cost)

• two steps:• Find the pairs suspected to be assembled

(Linear cost with the hash algorithm)

• Assembly them with dynamic programming.

Page 27: Master Course

Shotgun

accgtaccaccttta

tacctt

tttaac taacga

acgatac

accgaccgt

tacaggt

gataca

Given the graph

accgtacctttaacgatacaggt

but, the Hamiltonian has exponential cost!

Page 28: Master Course

Shotgun:

New problems arise

xxxxxxxxxxxxx

xxxxx

xxxxxx xxxxxx xxxxxxxx

accgaccgt

xxxxxxxxxxxxxx

• Consecutive repeats• Lack of coverage•…

Page 29: Master Course

Shotgun: properties of the coverage

Given the coverage:

Some questions arisess:

• What is the mean length of contigs?

• How many contigs we have to expect?

• What is the percentage of coverage?

Page 30: Master Course

Shotgun: percentage of coverage

Degree of coverage N d / L

Given the modelL

N d

We assume that segments are randomly distributed.

a base was covered by k segments is given by the binomial dsitribution (N,d / L):

The probability that

Prob{X=k}= (d/L)k (1-d/L)n-kNk

Page 31: Master Course

Then the probability that at least one segment covers a base is

Prob{X>0}= 1-Prob{X=0}= 1- e-

Shotgun: percentage of coverage

What is the limit of the binomial distribution n i p 0

having np=

Distribució de Poisson P()

Prob{X=k}= e- k

k!

= 1- e(N d / L)

Then, with N d / L = 4.6 we obtain a 99% of coverage

and with N d / L = 6.9 weobtain a 99.9% of coverage.

Page 32: Master Course

Assembly of ESTs

Is the same procedure than shotgun sequencing…

…but with a great one advantage:

there are many graphs with a small number of nodes!Connect to

http://alggen.lsi.upc.es

Links RESEARCH ESSEM


Recommended