+ All Categories
Home > Science > 2015 vancouver-vanbug

2015 vancouver-vanbug

Date post: 16-Jul-2015
Category:
Upload: ctitusbrown
View: 496 times
Download: 0 times
Share this document with a friend
Popular Tags:
51
Building a platform for bioinformatics: some exciting new directions for khmer . C. Titus Brown [email protected] March 12, 2015
Transcript
Page 1: 2015 vancouver-vanbug

Building a platform for bioinformatics: some exciting

new directions for khmer.

C. Titus Brown

[email protected]

March 12, 2015

Page 2: 2015 vancouver-vanbug

Hello!Associate Professor (#tenure!);

School of Veterinary Medicine

University of California, Davis.

More information at:

• ged.msu.edu/ ( URL needs to be updated :)

• github.com/ged-lab/

• ivory.idyll.org/blog/

• @ctitusbrown

Page 3: 2015 vancouver-vanbug

WarningsThis talk contains information that may constitute

“forward-looking statements.” Generally, the words

“believe,” “expect,” “intend,” “estimate,”

“anticipate,” “project,” “will” and similar expressions

identify forward-looking statements, which generally

are not historical in nature.

I have been advised to put this disclaimer in as well:

Dr. Brown is not currently under treatment for any

disorders related to megalomania.

Page 4: 2015 vancouver-vanbug

Introducing k-mers

CCGATTGCACTGGACCGA (<- read)

CCGATTGCAC

CGATTGCACT

GATTGCACTG

ATTGCACTGG

TTGCACTGGA

TGCACTGGAC

GCACTGGACC

ACTGGACCGA

Page 5: 2015 vancouver-vanbug

De Bruijn graphs –assemble on overlaps

J.R. Miller et al. / Genomics (2010)

Page 6: 2015 vancouver-vanbug

K-mers give you an implicit alignment

CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC

CATGGACCGATTGCACTGGACCGATGCACGGTACCG

Page 7: 2015 vancouver-vanbug

K-mers give you an implicit alignment

CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC

CATGGACCGATTGCACTGGACCGATGCACGGTACCG

CATGGACCGATTGCACTGGACCGATGCACGGACCG

(with no accounting for mismatches or indels)

Page 8: 2015 vancouver-vanbug

The problem with k-mers

CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC

CATGGACCGATTGCACTCGACCGATGCACGGTACCG

Each sequencing error results in k novel k-mers!

Page 9: 2015 vancouver-vanbug

The opportunity:

CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC

CATGGACCGATTGCACTCGACCGATGCACGGTACCG

The graph contains information about errors(can be used for error trimming in reads).

The graph also contains information about variants (can be used for variant calling).

Page 10: 2015 vancouver-vanbug

Conway T C , Bromage A J Bioinformatics 2011;27:479-486

© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,

please email: [email protected]

One big challenge: scalability!

De Bruijn graph size scales with # errors.

Page 11: 2015 vancouver-vanbug

One big challenge: scalability!

De Bruijn graph size scales with # errors.

Memory usage ~ “real” variation + number of errors

Number of errors ~ size of data set

Page 12: 2015 vancouver-vanbug

Conway T C , Bromage A J Bioinformatics 2011;27:479-486

© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,

please email: [email protected]

One big challenge: scalability!

De Bruijn graph size scales with # errors.

Page 13: 2015 vancouver-vanbug

Goals

• Initial goal: can we assemble large data sets??

• Longer-term goal: can we find efficient (De Bruijn?)

graph-based approaches to sequence analysis?

Page 14: 2015 vancouver-vanbug

First attempt: compressible De Bruijn graphs

1% 5%

15%10%

Pell et al., 2012

Can use Bloom filters to store

De Bruijn graph structures.

=> Overall structure

remains as you squish graphs

down.

Page 15: 2015 vancouver-vanbug

Technical challenges met (and defeated)

• Exhaustive in-memory traversal of graphs containing

5-15 billion nodes.

• Sequencing technology introduces false

connections in graph.

• Implementation lets us scale ~20x over other

approaches.

Pell et al., 2012

Page 16: 2015 vancouver-vanbug

Technical challenges met (and defeated)

• Exhaustive in-memory traversal of graphs containing

5-15 billion nodes.

• Sequencing technology introduces false

connections in graph.

• Implementation lets us scale ~20x over other

approaches, but this is not enough.

• Although, see Minia assembler (Chikhi et al.)

Pell et al., 2012

Page 17: 2015 vancouver-vanbug

Second attempt: diginorm

Conway T C , Bromage A J Bioinformatics 2011;27:479-486

© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,

please email: [email protected]

Page 18: 2015 vancouver-vanbug

Random sampling => deep sampling

needed

Typically 10-100x needed for robust recovery (30-300 Gbp for human)

Page 19: 2015 vancouver-vanbug

Actual coverage varies widely from the average.

Low coverage introduces unavoidable breaks.

Page 20: 2015 vancouver-vanbug

But! Shotgun sequencing is very redundant!

Lots of the high coverage simply isn’t needed.

(unnecessary data)

Page 21: 2015 vancouver-vanbug

Digital normalization

Page 22: 2015 vancouver-vanbug

Digital normalization

Page 23: 2015 vancouver-vanbug

Digital normalization

Page 24: 2015 vancouver-vanbug

Digital normalization

Page 25: 2015 vancouver-vanbug

Digital normalization

Page 26: 2015 vancouver-vanbug

Digital normalization

Page 27: 2015 vancouver-vanbug

Contig assembly now scales with underlying genome size

• Transcriptomes, microbial genomes incl MDA,

and most metagenomes can be assembled in

under 50 GB of RAM, with identical or improved

results.

• Memory efficient is improved by use of CountMin

Sketch.

Brown et al., 2012, arXiv.

Page 28: 2015 vancouver-vanbug

Diginorm is simple:

Page 29: 2015 vancouver-vanbug

Diginorm is only a good start:

• Diginorm alters the coverage of the data

set.

• Diginorm also discards lots of data!

• Various other infelicities…

oRepeats go away!

oCoverage estimation approach ~poor.

Page 30: 2015 vancouver-vanbug

Diginorm is a good start:

• Diginorm works on genomes,

metagenomes, and transcriptomes;

• Diginorm is streaming and uses

sublinear space.

Page 31: 2015 vancouver-vanbug

Third attempt: a semi-streaming

framework for sequence analysis

https://github.com/ged-lab/2014-streaming/

Page 32: 2015 vancouver-vanbug

Diginorm can detect graph saturation

Zhang et al., submitted.

Page 33: 2015 vancouver-vanbug

This generically permits semi-

streaming approaches.

Zhang et al., submitted.

Page 34: 2015 vancouver-vanbug

e.g. E. coli analysis => ~1.2 pass,

sublinear memory

Zhang et al., submitted.

Page 35: 2015 vancouver-vanbug

=> Efficient k-mer error trimming.

Zhang et al., submitted.

(This all works on metagenomes & transcriptomes, too.)

Page 36: 2015 vancouver-vanbug

Moving some sequence analysis to streaming.

~1.2 pass, sublinear memory

Zhang et al., submitted.

First pass: digital normalization - reduced set of k-mers.

Second pass: spectral analysis of data with reduced k-mer set.

First pass: collection of low-abundance reads + analysis of saturated reads.

Second pass: analysis of collected low-abundance reads.

First pass: collection of low-abundance reads + analysis of saturated reads.

(a)

(b)

(c)

two-pass;

reduced memory

few-pass;

reduced memory

online; streaming.

Page 37: 2015 vancouver-vanbug

Sublinear time/space read error analysis --

Zhang et al., submitted.

Read error profile from mouse mRNAseq (c.f. Grabherr et al., 2011).

Page 38: 2015 vancouver-vanbug

Another simple algorithm.

Zhang et al., submitted.

Page 39: 2015 vancouver-vanbug

So, that’s pretty cool, right?

• We provide simple time- and memory-efficient approaches for k-mer spectral analysis of large data sets.

• These semi-streaming approaches provide a general framework for applying k-mer spectral approaches to all(deep) sequencing data, including genomes, metagenomes, and RNAseq.

• The khmer software provides a functional and reasonably efficient reference implementation, freely available under the BSD license and actively developed at github.com/ged-lab/.

Page 40: 2015 vancouver-vanbug

Stream all the things! (1/2)

Page 41: 2015 vancouver-vanbug

Stream all the things! (2/2)

Page 42: 2015 vancouver-vanbug

But that’s not all!Buy now, and you can also get sequence-to-graph

alignment for the low, low price of free!*

graph = khmer.new_counting_hash(…)

aligner = khmer.ReadAligner(graph, trusted=5)

score, graph_align, read_align, is_truncated = \

aligner.align(seq)

* Terms and conditions may apply. Not all source code fully works :)

Page 43: 2015 vancouver-vanbug

Pair-HMM-based graph alignment

Jordan Fish and Michael Crusoe

Page 44: 2015 vancouver-vanbug

(Full model)

Jordan Fish and Michael Crusoe

Page 45: 2015 vancouver-vanbug

This is a general API…Many potential uses:

• Error correction;

• Variant calling;

• Counting (to replace mapping) & allelic counts;

• Align to multiple references;

• Tackle strain variation and polyploidy;

• Building consensus graphs from shallow population

sequencing;

• Consensus graph building from multiple read types;

• Protein-guided graph search (BlastGraph & Xander)

Page 46: 2015 vancouver-vanbug
Page 47: 2015 vancouver-vanbug

Whole-genome variant calling

Page 48: 2015 vancouver-vanbug

Graphalign is still alpha.• We don’t understand parameters well.

• Unoptimized.

• Not yet competitive with existing approaches.

• Broadly applicable!

• Hope to engage w/broader community, soon.

Page 49: 2015 vancouver-vanbug

Concluding thoughts #1

• None of our theory is particularly limited to De Bruijn

graphs, although our implementation is deeply tied

to them at the moment.

• We view these ideas (streaming; graphs) as a

potentially substantial improvement over current

mainstream approaches.

• We are not alone – there is a larger community

exploring these approaches! (GA4GH, esp.)

Page 50: 2015 vancouver-vanbug

Concluding thoughts #2

• Our implementations are usable but not yet terribly optimized.

• We are moving khmer towards a platform for providing reference implementations of these approaches, as well as for research and development.

• We are interested in providing components with decent performance & statistical guarantees, for fun and profit.

• Python and C++ FTW!

Page 51: 2015 vancouver-vanbug

Thanks!

Please contact me at [email protected]!


Recommended