Date post: | 18-Dec-2015 |
Category: |
Documents |
View: | 214 times |
Download: | 1 times |
The data flood: We need a bigger boat
James A. Foster
The Initiative for Bioinformatics and Evolutionary Studies
(IBEST)
Biological Sciences, Bioinformatics and Computational
Biology
University of Idaho
JAF INBRE Data Flood 8/4/09
Outline
✦Where is this flood of data coming from?✦What kind of tool is appropriate for this
amount of data?✦What kind of a tool is “bioinformatics”?✦How about an example?
JAF INBRE Data Flood 8/4/09
DNA sequencing data flood
Year bp/day
1977 7.35
1986 50-ish
1995 19,000
1998 400,000
2008 1,600,000,000
2009 3,200,000,000
2012 ??
ABI 3700
454
454/FLX
???
Technology
ABI 370
ABI 377
Gels
JAF INBRE Data Flood 8/4/09
The data flood: DNA example
Year bp/day Notes
1977 7.35 Manual: φx174
1986 50-ish Gel: ABI 370
1995 19,000 Gel: ABI 377
1998 400,000 Cap: ABI 3700
2008 1600000000
454
2009 3200000000
454/FLX
2012 ?? ??
Water
1L
Barrel (176 gallons)
Big pool (2x6x12m)
football field, 20m deep
Lakes Michigan/Huron
all Great Lakes (nearly)
ocean?
JAF INBRE Data Flood 8/4/09
Bioinformatics tools
Year Data volume
1977 1L
1986 barrel
1995 big pool
1998 football field
2008 Lake Michigan
2009 Great Lakes
2012 ocean?
Technology
hose
pfd
Kayak
Orca?
bigger boat?
Glomar?
spoon
JAF INBRE Data Flood 8/4/09
Bioinformatics: bigger boat?
Your thesis
Data
The Computer(bioinformatics)
Hypo
You
Your hypothesis
JAF INBRE Data Flood 8/4/09
Reflection on the metaphor
✦ At some point, you can use fundamentally different techniques: spoons versus boats
✦ At some point, you can test fundamentally new hypotheses: not “we need a smaller shark”
✦ Sometimes the old technology is still good: the kayak was appropriate in this picture
✦ The new technology may be for a different purpose: fishing versus deep sea exploration
JAF INBRE Data Flood 8/4/09
What does this do?
Not that!
THIS!
A Bigger BoatWhatever you tell it to do!
JAF INBRE Data Flood 8/4/09
What is Bioinformatics?
Bioinformatics is what you tell the computer to do with your data
JAF INBRE Data Flood 8/4/09
Of Boats and Bioinformatics
Bioinformatics is what you do with the boat you are in during the data flood
You might be able to do more with a bigger boat
JAF INBRE Data Flood 8/4/09
Sampling emergent diversity
✦Get ALL DNA along a age-variant transect• 10 samples per site• time since exposure:
5y, 19y, 40y, 63y, 100y, and 150y
• “chronoclines” sample ecosystems by age
✦Who’s there?✦How does ecosystem
change over time?
JAF INBRE Data Flood 8/4/09
Bioinformatics problems
✦Estimate α diversity: number of “species” in each sample and age group
✦Estimate β diversity: amount of variation in “species” between age groups
✦Determine which species (no quotes) are present in each sample (not part of this talk)
Biological questions: How do soil bacterial respond to retreating glaciers? How do microbial soil communities change?
JAF INBRE Data Flood 8/4/09
Lots of data (post QC)
Age Samples Sequences DNA Mbp
5y 9 35,092 8.77
19y 10 41,494 10.37
40y 8 33,665 8.42
63y 9 41,767 10.44
100y 8 41,178 10.29
150y 8 40,210 10.05
Total 52 233,406 58.35
Note: A SMALL run, max is 37GB/8hr run max, 1.6 Bbp/day
JAF INBRE Data Flood 8/4/09
Bioinformatics objectives
determine species
cluster by species
cluster by age
Explain data in terms of biological processes and age (tell a story)
Too much data: 233K sequences!
JAF INBRE Data Flood 8/4/09
Trick: Turn it upside down
Cluster each of 52 samples (approx. 6k each), choose a proxy sequence
Cluster proxies by age (approx. 40k each)
Cluster combined sequences to get species (quantify richness)
Build +/- matrix
++ + ++
++ - ++
+ - -
+ - +
- +++ +
JAF INBRE Data Flood 8/4/09
Bioinformatics challenges
✦Move data between computers (IGS, laptop, IBEST Core)
✦File the data in a retrievable way✦Associate metadata with data✦Cluster sequences within/between samples✦Associate clusters with species✦Compute diversity statistics✦Prepare publications and talks✦(much more)
JAF INBRE Data Flood 8/4/09
Conclusions
✦Biology• There are thousands of species of bacteria in
arctic soil• Number of bacterial species increases as time of
post-glacial exposure increase
✦Algorithmics (want a job?)• “Quantity has a quality all it’s own” (V.I.Lenin)• Need new algorithms to use new hardware• Database/dataset management is crucial
JAF INBRE Data Flood 8/4/09
Thanks!
Ursel Schüette
Zaid Abdo
Jacob Pierson
Larry Forney
Rob Lyon
The Forney-Top lab
John Bunge, Cornell
The Relational Database project, MSU
to INBRE for the excuse
to IBEST for the science
to NIH, NSF, and UI for the money (P20RR16448, P20RR016454, EPS080935)
JAF INBRE Data Flood 8/4/09
Metagenomics
✦Harvest approximately first 300bp of every 16s rRNA molecule, all samples• Ribosome: required to
translate DNA (conserved)
• Common marker for microbial species
✦Cluster by evolutionary relationships (“species”)
✦Analyze by chronocline
JAF INBRE Data Flood 8/4/09
Future work: same tune, new lyrics
✦Data from human microbiomeHow do microbial communities vary between healthy and sick people?
✦Data from polluted soil (Yangtzee river, PRC)How do microbial communities vary as pollution increases?
✦Data from longitudinal transectsHow does microbial diversity change with latitude?