+ All Categories
Home > Science > 2016 bergen-sars

2016 bergen-sars

Date post: 14-Jan-2017
Category:
Upload: ctitusbrown
View: 490 times
Download: 0 times
Share this document with a friend
56
A 12-step program for biology to survive and thrive in the era of data-intensive science C. Titus Brown Genome Center & Data Science Initiative Mar 18, 2016 Slides are on slideshare.net/c.titus.brown/
Transcript
Page 1: 2016 bergen-sars

A 12-step program for biology to survive and thrive in the era of data-

intensive science

C. Titus Brown

Genome Center & Data Science InitiativeMar 18, 2016

Slides are on slideshare.net/c.titus.brown/

Page 2: 2016 bergen-sars

Marek’s diseaseSoil metagenomicsAscidian GRNsLamprey mRNAseq

My path:

Page 3: 2016 bergen-sars

My guiding questionWhat is going to be happening in the

next 5 years with biological data generation?

(And can I make progress on some of the coming problems?)

Page 4: 2016 bergen-sars

DNA sequencing rates continues to grow.

Stephens et al., 2015 - 10.1371/journal.pbio.1002195

Page 5: 2016 bergen-sars

(2015 was a good year)

Page 6: 2016 bergen-sars

Oxford Nanopore sequencing

Slide via Torsten Seeman

Page 7: 2016 bergen-sars

Nanopore technology

Slide via Torsten Seeman

Page 8: 2016 bergen-sars

Scaling up --

Page 9: 2016 bergen-sars

Scaling up --

Page 10: 2016 bergen-sars

Slide via Torsten Seeman

Page 11: 2016 bergen-sars

http://ebola.nextflu.org/

Page 12: 2016 bergen-sars

“Fighting Ebola With a Palm-Sized DNA Sequencer”

See: http://www.theatlantic.com/science/archive/2015/09/

ebola-sequencer-dna-minion/405466/

Page 13: 2016 bergen-sars

“DeepDOM” cruise: examination of dissolved organic matter & microbial metabolism vs physical parameters – potential collab.

Via Elizabeth Kujawinski

Another challenge beyond volume and velocity – variety.

Page 14: 2016 bergen-sars

CRISPRThe challenge with genome editing is

fast becoming what to edit rather than how to do.

Page 15: 2016 bergen-sars

A point for reflection…

Increasingly, the best guide to the next 10 years of biology is science fiction ...

Page 16: 2016 bergen-sars

Digital normalization

Statement of problem:We can’t run de novo assembly on the

transcriptome data sets we have!

Page 17: 2016 bergen-sars

Shotgun sequencing and coverage

“Coverage” is simply the average number of reads that overlap

each true base in genome.

Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.

Page 18: 2016 bergen-sars

Random sampling => deep sampling needed

Typically 10-100x needed for robust recovery (30-300 Gbp for human)

Page 19: 2016 bergen-sars

Digital normalization

Page 20: 2016 bergen-sars

Digital normalization

Page 21: 2016 bergen-sars

Digital normalization

Page 22: 2016 bergen-sars

Digital normalization

Page 23: 2016 bergen-sars

Digital normalization

Page 24: 2016 bergen-sars

Digital normalization

Page 25: 2016 bergen-sars

(Digital normalization is a computational version of library

normalization)

Suppose you have a dilution factor of A (10) to B(1). To get

10x of B you need to get 100x of A!

Overkill!!

This 100x will consume disk space

and, because of errors, memory.

We can discard it for you…

Page 26: 2016 bergen-sars

Some key points --• Digital normalization is streaming.

• Digital normalizing is computationally efficient (lower memory than other approaches; parallelizable/multicore; single-pass)

• Currently, primarily used for prefiltering for assembly, but relies on underlying abstraction (De Bruijn graph) that is also used in variant calling.

Page 27: 2016 bergen-sars

Assembly now scales with information content, not data size.

• 10-100 fold decrease in memory requirements

• 10-100 fold speed up in analysis

Page 28: 2016 bergen-sars

Diginorm is widely useful:

1. Assembly of the H. contortus parasitic nematode genome, a “high polymorphism/variable coverage” problem.(Schwarz et al., 2013; pmid 23985341)

2. Reference-free assembly of the lamprey (P. marinus) transcriptome, a “big assembly” problem. (in prep)

3. Osedax symbiont metagenome, a “contaminated metagenome” problem (Goffredi et al, 2013; pmid 24225886)

Page 29: 2016 bergen-sars

Anecdata: diginorm is used in Illumina long-read sequencing (?)

Page 30: 2016 bergen-sars

Computational problems now scale with information content rather than data set size.

Most samples can be reconstructed via de novo assembly on commodity computers.

Page 31: 2016 bergen-sars

Applying digital normalization in a new project – the horse transcriptome

Tamer Mansour w/Bellone, Finno, Penedo, & Murray labs.

Page 32: 2016 bergen-sars

Input data Tissue Library length #samples #frag(M) #bp(Gb)BrainStem PE fr.firststrand 101 8 166.73 33.68Cerebellum PE fr.firststrand 100 24 411.48 82.3Muscle PE fr.firststrand 126 12 301.94 76.08Retina PE fr.unstranded 81 2 20.3 3.28SpinalCord PE fr.firststrand 101 16 403 81.4Skin PE fr.unstranded 81 2 18.54 3

SE fr.unstranded 81 2 16.57 1.34SE fr.unstranded 95 3 105.51 10.02

Embryo ICM PE fr.unstranded 100 3 126.32 25.26SE fr.unstranded 100 3 115.21 11.52

Embryo TE PE fr.unstranded 100 3 129.84 25.96SE fr.unstranded 100 3 102.26 10.23

Total 81 1917.7 364.07

Page 33: 2016 bergen-sars

equCabs current status - NCBI Annotation

Tamer Mansour

Page 34: 2016 bergen-sars

Library prepRead

trimmingMapping

to refMerge rep.

Trans Ass.

Merge by Tiss.

Predict ORF

Variant Ana

Update dbvar

Haplotype ass

Pool/diginorm

Predict ncRNA

Filter & Compare Ass.

filter knowns

Compare to public ann.

Merge All Ass.

Mapping to ref

Trans Ass.

Tamer Mansour

Page 35: 2016 bergen-sars

Digital normalization & (e.g.) horse transcriptome

The computational demands for cufflinks- Read binning (processing time)- Construction of gene models (no of genes, no of splicing junctions, no of reads per locus, sequencing errors, complexity of the locus like gene overlap and multiple isoforms (processing time & Memory utilization)

Diginorm- Significant reduction of binning time

- Relative increase of the resources required for gene model construction with merging more samples and tissues- ? false recombinant isoformsTamer Mansour

Page 36: 2016 bergen-sars

Effect of digital normalization

** Should be very valuable for detection of ncRNA

Tamer Mansour

Page 37: 2016 bergen-sars

The ORF problem

Hestand et al 2014: “we identified 301,829 positions with SNPs or small indels within these transcripts relative to EquCab2. Interestingly, 780 variants extend the open reading frame of the transcript and appear to be small errors in the equine reference genome”

Tamer Mansour

Page 38: 2016 bergen-sars

We merged the assemblies into six tissue-specific transcription profiles for cerebellum, brainstem, spinal cord, retina, muscle and skin. The final merger of all

assemblies overlaps with 63% and 73% of NCBI and Ensembl loci, respectively, capturing about 72% and 81% of their coding bases. Comparing our assembly to the most recent transcriptome annotation shows ~85% overlapping

loci. In addition, at least 40% of our annotated loci represent novel transcripts.

Tamer Mansour

Page 39: 2016 bergen-sars

Diginorm can also process data as it comes in – streaming

decision making.

Page 40: 2016 bergen-sars

What do we do when we get new data??

• How do we efficiently process, update our existing resources?

• How do we evaluate whether or not our prior conclusions need to change or be updated?– # of genes, & their annotations;– Differential expression based on new isoforms;

• This is a problem everyone has…and it’s not going away…

Page 41: 2016 bergen-sars

The data challenge in biology

So we can sequence everything – so what?

What does it mean?How can we do better biology with the data?

How can we understand?

Page 42: 2016 bergen-sars

A 12-step program for biology (??)

(This was a not terribly successfulattempt to be entertaining.)

Page 43: 2016 bergen-sars

1. Think repeatability and scaling

What works for one data set,

Doesn’t work as well for three,

And doesn’t work at all for 100.

Page 44: 2016 bergen-sars

2. Think streaming / few-pass analysis

versus

Page 45: 2016 bergen-sars

3. Invest in computational trainingSummer NGS workshop (2010-2017)

Page 46: 2016 bergen-sars

4. Move beyond PDFs

This is only part of the story!

Subramanian et al., doi: 10.1128/JVI.01163-13

Page 47: 2016 bergen-sars

5. Focus on a biological questionGenerating data for the sake of having data

leads you into a data analysis maze – “I’m sure there’s something interesting in there…

somewhere.”

Page 48: 2016 bergen-sars

"...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery

being associated with greater research momentum—a genomic bandwagon effect."

Ref.: Pandey et al. (2014), PLoS One 11, e88889. Via Erich Schwarz

The problem of lopsided gene characterization is pervasive:

e.g., the brain "ignorome"

6. Spend more effort on the unknowns!

Page 49: 2016 bergen-sars

7. Invest in data integration.

Figure 2. Summary of challenges associated with the data integration in the proposed project.

Figure via E. Kujawinski

Page 50: 2016 bergen-sars

8. Split your information into layersProtein coding >> ncRNA >> ???

** Should be very valuable for detection of ncRNA*** But what the heck do we do with ncRNA information?

Tamer Mansour

Page 51: 2016 bergen-sars

9. Move to an update model.

Page 52: 2016 bergen-sars

Candidates for additional steps…

• Invest in data sharing and better “reference” infrastructure.

• Build better tools for computationally exploring hypotheses.

• Invest in “unsupervised” analysis of data (machine learning)

• Learn/apply multivariate stats. • Invest in social media & preprints & “open”

Page 53: 2016 bergen-sars

My future plans?

• Protocols and (distributed) platform for data discovery & sharing.

• Data analysis and integration in marine biogeochemistry & microbial physiology

Page 54: 2016 bergen-sars

Fig. 1: The cycle from data to discovery, over models back to experiment, that generates knowledge as the cycle is repeated. Parts of that cycle are standard in particular disciplines, but putting together the full cycle requires transdisciplinary expertise.

Page 55: 2016 bergen-sars

Training program at UC Davis:

• Regular intensive workshops, half-day or longer.• Aimed at research practitioners (grad students &

more senior); open to all (including outside community).

• Novice (“zero entry”) on up.• Low cost for students.• Leverage global training initiatives.

(Google “dib training” for details; join the announce list!)

Page 56: 2016 bergen-sars

Thanks for listening!

Please contact me at [email protected]!


Recommended