+ All Categories
Home > Documents > On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end...

On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end...

Date post: 04-Jun-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
31
Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly Romeo Rizzi 1,* , Alexandru I. Tomescu 2,* , Veli M¨ akinen 2 1 Department of Computer Science, University of Verona, Italy 2 Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Finland * Equal contribution RECOMB-Seq 2014 31 March 2014 1 / 27
Transcript
Page 1: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

On the Complexity of Minimum Path Coverwith

Subpath Constraints for Multi-Assembly

Romeo Rizzi1,∗, Alexandru I. Tomescu2,∗, Veli Makinen2

1Department of Computer Science, University of Verona, Italy2Helsinki Institute for Information Technology HIIT,

Department of Computer Science, University of Helsinki, Finland∗ Equal contribution

RECOMB-Seq 201431 March 2014

1 / 27

Page 2: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

2 / 27

Page 3: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

MULTI-ASSEMBLY

Assembly of fragments from different, but related, sequencesI transcriptomics (RNA-Seq)I viral quasi-speciesI metagenomics

Assumptions:

" existing reference (genome-guided multi-assembly)

$ no existing annotation (annotation-free)

3 / 27

Page 4: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

OVERLAP AND SPLICING GRAPHS

Overlap graphs:I reads ≡ nodesI overlaps ≡ arcsI + coverage information

Splicing graphs:I exons ≡ nodesI reads overlapping two exons ≡ arcsI + coverage information

Existing reference =⇒ graphs are acyclic (DAGs)

4 / 27

Page 5: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

MINIMUM PATH COVER (MPC)

What is the minimum number of paths required to cover allnodes of a DAG?

I RNA-Seq: Cufflinks, CLASS, BRANCHI Viral quasi-species: ShoRAH

5 / 27

Page 6: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

MINIMUM PATH COVER (MPC)

What is the minimum number of paths required to cover allnodes of a DAG?

I RNA-Seq: Cufflinks, CLASS, BRANCHI Viral quasi-species: ShoRAH

5 / 27

Page 7: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

MINIMUM PATH COVER (MPC)

What is the minimum number of paths required to cover allnodes of a DAG?

I RNA-Seq: Cufflinks, CLASS, BRANCHI Viral quasi-species: ShoRAH

6 / 27

Page 8: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

MINIMUM PATH COVER (MPC)

In general it is NP-complete (one path iff G has a Hamiltonian path)

But it is solvable in polynomial-time on DAGs:

I Dilworth’s theorem 1950 + Fulkerson’s constructive proof 1956

I by a maximum matching algorithm, solvable in time O(t(G)√

n)

I the weighted version can be solved in time O(n2 log n + t(G)n)

where t(G) is the number of arcs in the transitive closure of G.

7 / 27

Page 9: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

MIN-COST MPC VIA MIN-COST FLOWS

I Unweighted case: MPC via Min-Flows, [Pijls, Potharst, 2013]I Weighted case: MPC via Min-cost Flows

Assuming we know the minimum size of a path cover:

≥ 1

≥ 1

≥ 1

≥ 1

≥ 1

≥ 1

≥ 1

≥ 1≥ 1

≥ 1

≥ 1

≥ 1

8 / 27

Page 10: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

MIN-COST MPC VIA MIN-COST FLOWS

I Unweighted case: MPC via Min-Flows, [Pijls, Potharst, 2013]I Weighted case: MPC via Min-cost Flows

Assuming we know the minimum size of a path cover:

≥ 1

≥ 1

≥ 1

≥ 1

≥ 1

≥ 1

≥ 1

≥ 1≥ 1

≥ 1

≥ 1

≥ 1

9 / 27

Page 11: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

MIN-COST MPC VIA MIN-COST FLOWS

I Unweighted case: MPC via Min-Flows, [Pijls, Potharst, 2013]I Weighted case: MPC via Min-cost Flows

Assuming we know the minimum size of a path cover:

≥ 1

≥ 1

≥ 1

≥ 1

≥ 1

≥ 1

≥ 1

≥ 1≥ 1

≥ 1

≥ 1

≥ 1

10 / 27

Page 12: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

MPC VIA MIN-COST FLOWS

This flow problem can be reduced to a Min-cost circulation problemI we add an arc from t to s with ‘large’ cost

I we have only demands (= 1)

I can be solved in time O(n2 log n + nm) by [Gabow and Tarjan,1991]

This is always better than O(n2 log n + nt(G)), because m ≤ t(G) ≤ n2

I as soon as there is a path of length O(n), we have t(G) = O(n2)

11 / 27

Page 13: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

MIN-COST MPC WITH SUBPATH CONSTRAINTS

12 / 27

Page 14: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

MIN-COST MPC WITH SUBPATH CONSTRAINTS

13 / 27

Page 15: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

MIN-COST MPC WITH SUBPATH CONSTRAINTS

INPUT: A DAG G and

1. A superset S of the sources of G, and a superset T of the sinks of G

2. A cost w(e) for each e ∈ E(G)

3. A family P in = {Pin1 , . . . ,Pin

t } of directed paths in G

TASK: Find a minimum number k of directed paths Psol1 , . . . ,Psol

k in Gsuch that

1. Every node in V(G) occurs in some Psoli

2. Every path Pin ∈ P in is entirely contained in some Psoli

3. Every path Psoli starts in a node of S and ends in a node of T

4.k∑

i=1

∑edge e∈Psol

i

w(e) is minimum among all tuples of k paths

satisfying 1.-3.

I introduced by [Bao, Jiang, Girke, 2013, BRANCH], but the case ofoverlapping constraints not solved

14 / 27

Page 16: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

MIN-COST MPC WITH SUBPATH CONSTRAINTS

s t

15 / 27

Page 17: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

MIN-COST MPC WITH SUBPATH CONSTRAINTS

Subpath constraints as arc demands:

≥ 0≥ 0≥ 0

≥ 1

16 / 27

Page 18: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

MIN-COST MPC WITH SUBPATH CONSTRAINTSProblem 1: a constraint P included in another constraint Q

≥ 0≥ 0≥ 0

≥ 1≥ 1

I Remove PI Can be implemented in time O(N) with a suffix tree for large

alphabets, [Farach, 1997]I N = sum of lengths of Subpath Constraints

17 / 27

Page 19: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

MIN-COST MPC WITH SUBPATH CONSTRAINTS

Problem 2: Suffix-prefix overlaps

≥ 0≥ 0≥ 0

≥ 1 ≥ 1

≥ 0

I Iteratively merge constraints with longest suffix-prefix overlapI All suffix-prefix overlaps can be found in optimal time

O(N + overlaps) by [Gusfield, Landau and Schieber, 1992]I Our iterative merging also takes O(N + overlaps) time

18 / 27

Page 20: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

MIN-COST MPC WITH SUBPATH CONSTRAINTS

Pre-processing phaseI O(N + c2)

I overlaps ≤ c2

The flow problem can be reduced to a Min-cost circulation problemI we add an arc from t to s with ‘large’ costI O(n) nodes and O(m + c) arcsI only demands (= 1)

Min-cost MPC with Subpath Constraints can be solved in timeO(N + c2 + n2 log n + n(m + c)) by [Gabow and Tarjan, 1991]

19 / 27

Page 21: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

MPC WITH PAIRED SUBPATH CONSTRAINTS

INPUT: A DAG G and

1. A family P in = {(Pin1,1,Pin

1,2), . . . , (Pint,1,Pin

t,2)} of pairs of directedpaths in G

TASK: Find a minimum number k of directed paths Psol1 , . . . ,Psol

k in Gsuch that

1. Every node in V(G) occurs in some Psoli

2. For every pair (Pinj,1,Pin

j,2) ∈ P in, there exists Psoli such that both Pin

j,1

and Pinj,2 are entirely contained in Psol

i

I introduced by [Song and Florea, 2013, CLASS]I we show that it is

I NP-hard; not FPT when parametrized by kI FPT in the number of constraints and nodes that need to be covered

I solved in parallel by [Beerenwinkel, Beretta, Bonizzoni, Dondi andPirola, 2014]

20 / 27

Page 22: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

GENE AKT3 - ANNOTATION AND CUFFLINKS

2/1

3/1

4,

1:4

5 P

M

Page

1 o

f 1

file

:///

Use

rs/t

omes

cu/D

ropbox

/Bio

info

rmat

ics/

Path

%2

0C

over

%2

0M

ulti

-Ass

embly

/BM

C-s

ubm

issi

on/A

KT3

/tru

th.s

vg

0 1

3

2

4 5 6 7

9

8

10 11 12 13

15

14

16 22

24

17

1821

19

20

23

2526

27

2/1

2/1

4,

3:5

5 P

M

Page

1 o

f 1

file

:///

Use

rs/t

omes

cu/D

ropbox

/Bio

info

rmat

ics/

Path

%2

0C

over

%2

0M

ulti

-Ass

embly

/BM

C-s

ubm

issi

on/A

KT3

/cuffl

inks

.svg

0 1

3

2

4 5 6 7

9

8

10 11 12 13

15

14

16 22

24

17

1821

19

20

23

2526

27

21 / 27

Page 23: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

GENE AKT3 - ANNOTATION AND SUBPATHS

2/1

3/1

4,

1:4

5 P

M

Page

1 o

f 1

file

:///

Use

rs/t

omes

cu/D

ropbox

/Bio

info

rmat

ics/

Path

%2

0C

over

%2

0M

ulti

-Ass

embly

/BM

C-s

ubm

issi

on/A

KT3

/tru

th.s

vg

0 1

3

2

4 5 6 7

9

8

10 11 12 13

15

14

16 22

24

17

1821

19

20

23

2526

27

2/1

2/1

4,

3:5

5 P

M

Page

1 o

f 1

file

:///

Use

rs/t

omes

cu/D

ropbox

/Bio

info

rmat

ics/

Path

%2

0C

over

%2

0M

ulti

-Ass

embly

/BM

C-s

ubm

issi

on/A

KT3

/sub

pat

hs.s

vg

0 1

3

2

4 5 6 7

9

8

10 11 12 13

15

14

16 22

24

17

1821

19

20

23

2526

27

22 / 27

Page 24: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

GENE AKT3 - ANNOTATION AND MERGED SUBPATHS

2/1

3/1

4,

1:4

5 P

M

Page

1 o

f 1

file

:///

Use

rs/t

omes

cu/D

ropbox

/Bio

info

rmat

ics/

Path

%2

0C

over

%2

0M

ulti

-Ass

embly

/BM

C-s

ubm

issi

on/A

KT3

/tru

th.s

vg

0 1

3

2

4 5 6 7

9

8

10 11 12 13

15

14

16 22

24

17

1821

19

20

23

2526

27

2/1

2/1

4,

3:5

5 P

M

Page

1 o

f 1

file

:///

Use

rs/t

omes

cu/D

ropbox

/Bio

info

rmat

ics/

Path

%2

0C

over

%2

0M

ulti

-Ass

embly

/BM

C-s

ubm

issi

on/A

KT3

/sub

pat

hs-m

erge

d.s

vg

0 1

3

2

4 5 6 7

9

8

10 11 12 13

15

14

16 22

24

17

1821

19

20

23

2526

27

23 / 27

Page 25: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

GENE AKT3 - ANNOTATION AND MPC-SC

2/1

3/1

4,

1:4

5 P

M

Page

1 o

f 1

file

:///

Use

rs/t

omes

cu/D

ropbox

/Bio

info

rmat

ics/

Path

%2

0C

over

%2

0M

ulti

-Ass

embly

/BM

C-s

ubm

issi

on/A

KT3

/tru

th.s

vg

0 1

3

2

4 5 6 7

9

8

10 11 12 13

15

14

16 22

24

17

1821

19

20

23

2526

27

2/1

2/1

4,

3:5

6 P

M

Page

1 o

f 1

file

:///

Use

rs/t

omes

cu/D

ropbox

/Bio

info

rmat

ics/

Path

%2

0C

over

%2

0M

ulti

-Ass

embly

/BM

C-s

ubm

issi

on/A

KT3

/us.

svg

0 1

3

2

4 5 6 7

9

8

10 11 12 13

15

14

16 22

24

17

1821

19

20

23

2526

27

24 / 27

Page 26: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

CONCLUSIONS

I Min-cost Minimum Path Cover

O(n2 log n + nm)

I Min-cost Minimum Path Cover with Subpath Constraints

O(N + c2 + n2 log n + n(m + c))I c = number of Subpath ConstraintsI N = sum of lengths of Subpath Constraints

I Minimum Path Cover with Pairs of Subpaths Constraints

NP-hard, but FPT in the total number of constraints

I Future work: a better integration of observed coverages

I Implementation for RNA-Seq reads under way

25 / 27

Page 27: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

ACKNOWLEDGEMENTSPartial support by

I Academy of Finland — Centre of Excellence in Cancer GenomicsResearch (grant 250345)

I Finnish Cultural Foundation

Romeo Rizzi Veli Makinen

Thanks toI Anna Kuosmanen and Ahmed Sobih for preliminary

implementation and experiments

26 / 27

Page 28: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

Multi-assembly Long reads Paired-end reads

Thank you!

27 / 27

Page 29: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

PICTORIAL PROOF OF STEP 2.LEMMA

Step 2. does not increase the cardinality of the solution path cover.

ui viuj vjuk

vk

Pi PjPk

=⇒ui viuj vjuk

vk

Pi PjPk

ui viuj vjvk

uk

Pi Pj

Pk

1 / 3

Page 30: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

NP-COMPLETENESS OF PROBLEM MPC-PSC

0 1 2 n−1 n n+1 n+2 n+m−1 n+m

v1 v2 vn· · · · · ·[vi1 ]

[vj1 ]

e1 = vi1 vj1 e2 = vi2 vj2 em = vim vjm

[vi2 ]

[vj2 ]

[vim ]

[vjm ]

THEOREM

Problem MPC-PSC is NP-complete.

I A graph G = ({v1, . . . , vn}, {e1, . . . , em}) has chromatic number 3iff the DAG above admits a solution with 3 paths.

COROLLARY

For no ε > 0 there exists a( 4

3 − ε)-approximation algorithm for Problem

MPC-PSC unless P=NP. Moreover, the problem is not FPT whenparameterized on OPT (the minimum number of paths in a solution). 2 / 3

Page 31: On the Complexity of Minimum Path Cover with Subpath ... · Multi-assembly Long reads Paired-end reads On the Complexity of Minimum Path Cover with Subpath Constraints for Multi-Assembly

PROBLEM MPC-PSC IS FPT IN THE TOTAL NUMBER

OF CONSTRAINTS

LEMMA

Let C be a set of constraints on a DAG. There exists a directed path P in Gwhich satisfies all constraints in C iff any two constraints in C arecompatible.

THEOREM

Given an instance for Problem MPC-PSC, we can decide in polynomial timeif OPT = 2, and if so, find the two solution paths. Moreover, ProblemMPC-PSC is fixed-parameter tractable (FPT) in the total number C of inputconstraints.

I construct the ‘in-compatibility’ graph; this is bipartite iffOPT = 2

I partition the set of constraints in all possible ways and check thatall constraints in every class are pairwise compatible

3 / 3


Recommended