Date post: | 14-Jan-2017 |
Category: |
Science |
Upload: | ctitusbrown |
View: | 490 times |
Download: | 0 times |
A 12-step program for biology to survive and thrive in the era of data-
intensive science
C. Titus Brown
Genome Center & Data Science InitiativeMar 18, 2016
Slides are on slideshare.net/c.titus.brown/
Marek’s diseaseSoil metagenomicsAscidian GRNsLamprey mRNAseq
My path:
My guiding questionWhat is going to be happening in the
next 5 years with biological data generation?
(And can I make progress on some of the coming problems?)
DNA sequencing rates continues to grow.
Stephens et al., 2015 - 10.1371/journal.pbio.1002195
(2015 was a good year)
Oxford Nanopore sequencing
Slide via Torsten Seeman
Nanopore technology
Slide via Torsten Seeman
Scaling up --
Scaling up --
Slide via Torsten Seeman
http://ebola.nextflu.org/
“Fighting Ebola With a Palm-Sized DNA Sequencer”
See: http://www.theatlantic.com/science/archive/2015/09/
ebola-sequencer-dna-minion/405466/
“DeepDOM” cruise: examination of dissolved organic matter & microbial metabolism vs physical parameters – potential collab.
Via Elizabeth Kujawinski
Another challenge beyond volume and velocity – variety.
CRISPRThe challenge with genome editing is
fast becoming what to edit rather than how to do.
A point for reflection…
Increasingly, the best guide to the next 10 years of biology is science fiction ...
Digital normalization
Statement of problem:We can’t run de novo assembly on the
transcriptome data sets we have!
Shotgun sequencing and coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
Random sampling => deep sampling needed
Typically 10-100x needed for robust recovery (30-300 Gbp for human)
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
(Digital normalization is a computational version of library
normalization)
Suppose you have a dilution factor of A (10) to B(1). To get
10x of B you need to get 100x of A!
Overkill!!
This 100x will consume disk space
and, because of errors, memory.
We can discard it for you…
Some key points --• Digital normalization is streaming.
• Digital normalizing is computationally efficient (lower memory than other approaches; parallelizable/multicore; single-pass)
• Currently, primarily used for prefiltering for assembly, but relies on underlying abstraction (De Bruijn graph) that is also used in variant calling.
Assembly now scales with information content, not data size.
• 10-100 fold decrease in memory requirements
• 10-100 fold speed up in analysis
Diginorm is widely useful:
1. Assembly of the H. contortus parasitic nematode genome, a “high polymorphism/variable coverage” problem.(Schwarz et al., 2013; pmid 23985341)
2. Reference-free assembly of the lamprey (P. marinus) transcriptome, a “big assembly” problem. (in prep)
3. Osedax symbiont metagenome, a “contaminated metagenome” problem (Goffredi et al, 2013; pmid 24225886)
Anecdata: diginorm is used in Illumina long-read sequencing (?)
Computational problems now scale with information content rather than data set size.
Most samples can be reconstructed via de novo assembly on commodity computers.
Applying digital normalization in a new project – the horse transcriptome
Tamer Mansour w/Bellone, Finno, Penedo, & Murray labs.
Input data Tissue Library length #samples #frag(M) #bp(Gb)BrainStem PE fr.firststrand 101 8 166.73 33.68Cerebellum PE fr.firststrand 100 24 411.48 82.3Muscle PE fr.firststrand 126 12 301.94 76.08Retina PE fr.unstranded 81 2 20.3 3.28SpinalCord PE fr.firststrand 101 16 403 81.4Skin PE fr.unstranded 81 2 18.54 3
SE fr.unstranded 81 2 16.57 1.34SE fr.unstranded 95 3 105.51 10.02
Embryo ICM PE fr.unstranded 100 3 126.32 25.26SE fr.unstranded 100 3 115.21 11.52
Embryo TE PE fr.unstranded 100 3 129.84 25.96SE fr.unstranded 100 3 102.26 10.23
Total 81 1917.7 364.07
equCabs current status - NCBI Annotation
Tamer Mansour
Library prepRead
trimmingMapping
to refMerge rep.
Trans Ass.
Merge by Tiss.
Predict ORF
Variant Ana
Update dbvar
Haplotype ass
Pool/diginorm
Predict ncRNA
Filter & Compare Ass.
filter knowns
Compare to public ann.
Merge All Ass.
Mapping to ref
Trans Ass.
Tamer Mansour
Digital normalization & (e.g.) horse transcriptome
The computational demands for cufflinks- Read binning (processing time)- Construction of gene models (no of genes, no of splicing junctions, no of reads per locus, sequencing errors, complexity of the locus like gene overlap and multiple isoforms (processing time & Memory utilization)
Diginorm- Significant reduction of binning time
- Relative increase of the resources required for gene model construction with merging more samples and tissues- ? false recombinant isoformsTamer Mansour
Effect of digital normalization
** Should be very valuable for detection of ncRNA
Tamer Mansour
The ORF problem
Hestand et al 2014: “we identified 301,829 positions with SNPs or small indels within these transcripts relative to EquCab2. Interestingly, 780 variants extend the open reading frame of the transcript and appear to be small errors in the equine reference genome”
Tamer Mansour
We merged the assemblies into six tissue-specific transcription profiles for cerebellum, brainstem, spinal cord, retina, muscle and skin. The final merger of all
assemblies overlaps with 63% and 73% of NCBI and Ensembl loci, respectively, capturing about 72% and 81% of their coding bases. Comparing our assembly to the most recent transcriptome annotation shows ~85% overlapping
loci. In addition, at least 40% of our annotated loci represent novel transcripts.
Tamer Mansour
Diginorm can also process data as it comes in – streaming
decision making.
What do we do when we get new data??
• How do we efficiently process, update our existing resources?
• How do we evaluate whether or not our prior conclusions need to change or be updated?– # of genes, & their annotations;– Differential expression based on new isoforms;
• This is a problem everyone has…and it’s not going away…
The data challenge in biology
So we can sequence everything – so what?
What does it mean?How can we do better biology with the data?
How can we understand?
A 12-step program for biology (??)
(This was a not terribly successfulattempt to be entertaining.)
1. Think repeatability and scaling
What works for one data set,
Doesn’t work as well for three,
And doesn’t work at all for 100.
2. Think streaming / few-pass analysis
versus
3. Invest in computational trainingSummer NGS workshop (2010-2017)
4. Move beyond PDFs
This is only part of the story!
Subramanian et al., doi: 10.1128/JVI.01163-13
5. Focus on a biological questionGenerating data for the sake of having data
leads you into a data analysis maze – “I’m sure there’s something interesting in there…
somewhere.”
"...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery
being associated with greater research momentum—a genomic bandwagon effect."
Ref.: Pandey et al. (2014), PLoS One 11, e88889. Via Erich Schwarz
The problem of lopsided gene characterization is pervasive:
e.g., the brain "ignorome"
6. Spend more effort on the unknowns!
7. Invest in data integration.
Figure 2. Summary of challenges associated with the data integration in the proposed project.
Figure via E. Kujawinski
8. Split your information into layersProtein coding >> ncRNA >> ???
** Should be very valuable for detection of ncRNA*** But what the heck do we do with ncRNA information?
Tamer Mansour
9. Move to an update model.
Candidates for additional steps…
• Invest in data sharing and better “reference” infrastructure.
• Build better tools for computationally exploring hypotheses.
• Invest in “unsupervised” analysis of data (machine learning)
• Learn/apply multivariate stats. • Invest in social media & preprints & “open”
My future plans?
• Protocols and (distributed) platform for data discovery & sharing.
• Data analysis and integration in marine biogeochemistry & microbial physiology
Fig. 1: The cycle from data to discovery, over models back to experiment, that generates knowledge as the cycle is repeated. Parts of that cycle are standard in particular disciplines, but putting together the full cycle requires transdisciplinary expertise.
Training program at UC Davis:
• Regular intensive workshops, half-day or longer.• Aimed at research practitioners (grad students &
more senior); open to all (including outside community).
• Novice (“zero entry”) on up.• Low cost for students.• Leverage global training initiatives.
(Google “dib training” for details; join the announce list!)