2016 bergen-sars

Post on 14-Jan-2017

490 views 0 download

transcript

A 12-step program for biology to survive and thrive in the era of data-

intensive science

C. Titus Brown

Genome Center & Data Science InitiativeMar 18, 2016

Slides are on slideshare.net/c.titus.brown/

Marek’s diseaseSoil metagenomicsAscidian GRNsLamprey mRNAseq

My path:

My guiding questionWhat is going to be happening in the

next 5 years with biological data generation?

(And can I make progress on some of the coming problems?)

DNA sequencing rates continues to grow.

Stephens et al., 2015 - 10.1371/journal.pbio.1002195

(2015 was a good year)

Oxford Nanopore sequencing

Slide via Torsten Seeman

Nanopore technology

Slide via Torsten Seeman

Scaling up --

Scaling up --

Slide via Torsten Seeman

http://ebola.nextflu.org/

“Fighting Ebola With a Palm-Sized DNA Sequencer”

See: http://www.theatlantic.com/science/archive/2015/09/

ebola-sequencer-dna-minion/405466/

“DeepDOM” cruise: examination of dissolved organic matter & microbial metabolism vs physical parameters – potential collab.

Via Elizabeth Kujawinski

Another challenge beyond volume and velocity – variety.

CRISPRThe challenge with genome editing is

fast becoming what to edit rather than how to do.

A point for reflection…

Increasingly, the best guide to the next 10 years of biology is science fiction ...

Digital normalization

Statement of problem:We can’t run de novo assembly on the

transcriptome data sets we have!

Shotgun sequencing and coverage

“Coverage” is simply the average number of reads that overlap

each true base in genome.

Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.

Random sampling => deep sampling needed

Typically 10-100x needed for robust recovery (30-300 Gbp for human)

Digital normalization

Digital normalization

Digital normalization

Digital normalization

Digital normalization

Digital normalization

(Digital normalization is a computational version of library

normalization)

Suppose you have a dilution factor of A (10) to B(1). To get

10x of B you need to get 100x of A!

Overkill!!

This 100x will consume disk space

and, because of errors, memory.

We can discard it for you…

Some key points --• Digital normalization is streaming.

• Digital normalizing is computationally efficient (lower memory than other approaches; parallelizable/multicore; single-pass)

• Currently, primarily used for prefiltering for assembly, but relies on underlying abstraction (De Bruijn graph) that is also used in variant calling.

Assembly now scales with information content, not data size.

• 10-100 fold decrease in memory requirements

• 10-100 fold speed up in analysis

Diginorm is widely useful:

1. Assembly of the H. contortus parasitic nematode genome, a “high polymorphism/variable coverage” problem.(Schwarz et al., 2013; pmid 23985341)

2. Reference-free assembly of the lamprey (P. marinus) transcriptome, a “big assembly” problem. (in prep)

3. Osedax symbiont metagenome, a “contaminated metagenome” problem (Goffredi et al, 2013; pmid 24225886)

Anecdata: diginorm is used in Illumina long-read sequencing (?)

Computational problems now scale with information content rather than data set size.

Most samples can be reconstructed via de novo assembly on commodity computers.

Applying digital normalization in a new project – the horse transcriptome

Tamer Mansour w/Bellone, Finno, Penedo, & Murray labs.

Input data Tissue Library length #samples #frag(M) #bp(Gb)BrainStem PE fr.firststrand 101 8 166.73 33.68Cerebellum PE fr.firststrand 100 24 411.48 82.3Muscle PE fr.firststrand 126 12 301.94 76.08Retina PE fr.unstranded 81 2 20.3 3.28SpinalCord PE fr.firststrand 101 16 403 81.4Skin PE fr.unstranded 81 2 18.54 3

SE fr.unstranded 81 2 16.57 1.34SE fr.unstranded 95 3 105.51 10.02

Embryo ICM PE fr.unstranded 100 3 126.32 25.26SE fr.unstranded 100 3 115.21 11.52

Embryo TE PE fr.unstranded 100 3 129.84 25.96SE fr.unstranded 100 3 102.26 10.23

Total 81 1917.7 364.07

equCabs current status - NCBI Annotation

Tamer Mansour

Library prepRead

trimmingMapping

to refMerge rep.

Trans Ass.

Merge by Tiss.

Predict ORF

Variant Ana

Update dbvar

Haplotype ass

Pool/diginorm

Predict ncRNA

Filter & Compare Ass.

filter knowns

Compare to public ann.

Merge All Ass.

Mapping to ref

Trans Ass.

Tamer Mansour

Digital normalization & (e.g.) horse transcriptome

The computational demands for cufflinks- Read binning (processing time)- Construction of gene models (no of genes, no of splicing junctions, no of reads per locus, sequencing errors, complexity of the locus like gene overlap and multiple isoforms (processing time & Memory utilization)

Diginorm- Significant reduction of binning time

- Relative increase of the resources required for gene model construction with merging more samples and tissues- ? false recombinant isoformsTamer Mansour

Effect of digital normalization

** Should be very valuable for detection of ncRNA

Tamer Mansour

The ORF problem

Hestand et al 2014: “we identified 301,829 positions with SNPs or small indels within these transcripts relative to EquCab2. Interestingly, 780 variants extend the open reading frame of the transcript and appear to be small errors in the equine reference genome”

Tamer Mansour

We merged the assemblies into six tissue-specific transcription profiles for cerebellum, brainstem, spinal cord, retina, muscle and skin. The final merger of all

assemblies overlaps with 63% and 73% of NCBI and Ensembl loci, respectively, capturing about 72% and 81% of their coding bases. Comparing our assembly to the most recent transcriptome annotation shows ~85% overlapping

loci. In addition, at least 40% of our annotated loci represent novel transcripts.

Tamer Mansour

Diginorm can also process data as it comes in – streaming

decision making.

What do we do when we get new data??

• How do we efficiently process, update our existing resources?

• How do we evaluate whether or not our prior conclusions need to change or be updated?– # of genes, & their annotations;– Differential expression based on new isoforms;

• This is a problem everyone has…and it’s not going away…

The data challenge in biology

So we can sequence everything – so what?

What does it mean?How can we do better biology with the data?

How can we understand?

A 12-step program for biology (??)

(This was a not terribly successfulattempt to be entertaining.)

1. Think repeatability and scaling

What works for one data set,

Doesn’t work as well for three,

And doesn’t work at all for 100.

2. Think streaming / few-pass analysis

versus

3. Invest in computational trainingSummer NGS workshop (2010-2017)

4. Move beyond PDFs

This is only part of the story!

Subramanian et al., doi: 10.1128/JVI.01163-13

5. Focus on a biological questionGenerating data for the sake of having data

leads you into a data analysis maze – “I’m sure there’s something interesting in there…

somewhere.”

"...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery

being associated with greater research momentum—a genomic bandwagon effect."

Ref.: Pandey et al. (2014), PLoS One 11, e88889. Via Erich Schwarz

The problem of lopsided gene characterization is pervasive:

e.g., the brain "ignorome"

6. Spend more effort on the unknowns!

7. Invest in data integration.

Figure 2. Summary of challenges associated with the data integration in the proposed project.

Figure via E. Kujawinski

8. Split your information into layersProtein coding >> ncRNA >> ???

** Should be very valuable for detection of ncRNA*** But what the heck do we do with ncRNA information?

Tamer Mansour

9. Move to an update model.

Candidates for additional steps…

• Invest in data sharing and better “reference” infrastructure.

• Build better tools for computationally exploring hypotheses.

• Invest in “unsupervised” analysis of data (machine learning)

• Learn/apply multivariate stats. • Invest in social media & preprints & “open”

My future plans?

• Protocols and (distributed) platform for data discovery & sharing.

• Data analysis and integration in marine biogeochemistry & microbial physiology

Fig. 1: The cycle from data to discovery, over models back to experiment, that generates knowledge as the cycle is repeated. Parts of that cycle are standard in particular disciplines, but putting together the full cycle requires transdisciplinary expertise.

Training program at UC Davis:

• Regular intensive workshops, half-day or longer.• Aimed at research practitioners (grad students &

more senior); open to all (including outside community).

• Novice (“zero entry”) on up.• Low cost for students.• Leverage global training initiatives.

(Google “dib training” for details; join the announce list!)

Thanks for listening!

Please contact me at ctbrown@ucdavis.edu!