+ All Categories
Home > Documents > MICROBIOME SOFTWARE: END OF...

MICROBIOME SOFTWARE: END OF...

Date post: 12-Mar-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
28
MICROBIOME SOFTWARE: END OF BEGINNING. DR. CHARLES ROBERTSON DIVISION OF INFECTIOUS DISEASES, UNIVERSITY OF COLORADO SCHOOL OF MEDICINE DR. DANIEL N. FRANK, DIVISION OF INFECTIOUS DISEASES, SCHOOL OF MEDICINE DR. J. KIRK HARRIS, DEPT. OF PEDIATRICS, SCHOOL OF MEDICINE & CHILDREN’S HOSPITAL CO 2016-12-01
Transcript
Page 1: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

MICROBIOME SOFTWARE: END OF BEGINNING.DR. CHARLES ROBERTSON

DIVISION OF INFECTIOUS DISEASES, UNIVERSITY OF COLORADO SCHOOL OF MEDICINE

DR. DANIEL N. FRANK, DIVISION OF INFECTIOUS DISEASES, SCHOOL OF MEDICINE

DR. J. KIRK HARRIS, DEPT. OF PEDIATRICS, SCHOOL OF MEDICINE & CHILDREN’S HOSPITAL CO

2016-12-01

Page 2: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

OVERVIEW

Microbiome Sequence Analysis ToolsSequence

Data Results

Today: Look at three items in the Black Box

Page 3: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

OUR MENTOR: NORMAN PACE NORM’S MENTOR: CARL WOESE

Nucleic acid biochemistExtensive ribozyme work: RNase PInvented the basis of all microbiome studies:

The culture independent methodMember NAS, election year 1991

Nucleic acid biochemistDiscovered the ArchaeaPut forth the RNA world hypothesisMember NAS, election year 1988

Page 4: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

MY BACKGROUND

• Extensive use of computers starting in 1968

• Itinerant programmer in the US and Europe

• BS Electrical Engineering & Computer Science, 1982 School of Engineering, University of Colorado, Boulder

• 2 years spent doing logic design at a supercomputer company

• 25 years in the Electronic Design Automation business • Building commercial software to solve NP complete problems

• Hardware description languages, Circuit Simulation, Logic Simulation, Placement, Routing, PCB’s & IC’s

• Last position in the EDA industry: CEO

• PhD, 2008 Molecular, Cellular, and Developmental Biology, University of Colorado, Boulder

• Current: Hardware/Software/Sequence Analysis

• Have processed >200 MiSeq runs in the last 5 years (> 2 billion sequences)• ~90% medical

• ~10% environmental • (Customers: primarily CU Boulder School of Engineering)

Page 5: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

Sample Community Composition

ExtractDNA

AmplifyOne

MoleculeSequence

Identify & CountEach

Sequence Type

wet bench work computer informatics

THE MICROBIOME PROCESS

The primary topic of this presentation

Page 6: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

• The culture independent method• use DNA sequences to identify microbes

• Woese selected the ribosome

• Ribosome: complex machine that assembles proteins from amino acids per information encoded in chromosomes

• A heavily constrained portion of the information processing system

• The ribosome is a ribozyme

• Shape is everything…• Precise positioning of reactants and

charged ions to get enzymatic activity

Page 7: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

Small SubUnit rRNA16S

Page 8: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

• Information rich molecule

• Primary sequence

• Secondary Structure

• Information content is non-uniformly distributed across the entire molecule

• The Tree of Life cannot be reproduced with short sequences

• Amplicon access via universal primers

• Desire uniform amplification of all “kinds”

• Ever a compromise between length (cost per sequence) and primer locations

• Easy to identify phylum by short sequence

• No simple/consistent way to get to “species” with short sequences

Small SubUnit rRNA16S

Page 9: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

• Ordination

The Human Microbiome Consortium. 2012. Nature 486:207-214

Page 10: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

IN THE BEGINNING, WITH SANGER SEQUENCING

• Only use full length sequences for analysis

• The full length sequences had to be ALIGNED (NP complete)

• Why align? To assure comparison of homologous nucleotides

• Build phylogenetic trees (NP complete)• Iteratively make informed guesses as to the shapes of trees & measure their probabilities

• Informed trial and error!

G-CGTAATCGAAGGCCATTACGCTTGCGTAATGGCCCGATTACG-CGCC-TAATCG--GGCCATTACGCTTGCGTAATGGCCCGATTA-GGCGCCGT-ATCGAAGGCCATTAC-CTTG-GTAATGGCCCGAT-ACGGCGCCGTAATC---GGC-ATTACGCTTGCGTAAT-GCC-GATTACGGCGCCGTAATCGAAGGCCATTA-GCTTGC-TAATGGCCCGATTACGGC

G – CC – GA – TT – AT – AA – TC – GC – GG – CG – CG – CC – GT – AA – TA – TT – AG – CC – GG – C

AA

CT T

G

Page 11: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

Pace, NR. Science,1997 May 2;276(5313):734-40.

1990, Pre and early Sanger

1997, Sanger

Woese, et al. PNAS, 1990 June; 87 (12):4576-9.

Page 12: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

ADVENT OF NEXT GENERATION SEQUENCING

• Induced very rapid change due to very large decrease in price per sequence• Sequences/Sample: Sanger/454/MiSeq: 96/8,000/100,000

• Other scientific disciplines suddenly very motivated to explore the microbiomes of their knowledge sub-domains

• Ecologists

• Geologists

• Physicians

• A big problem arose: Alignment & Tree building s/w of that time did not scale well• Existing analysis approaches (computers/software tools) could not cope with the onslaught of the

large number of sequences in NGS datasets

Page 13: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

SOLUTION: NEW TOOLS… THE PROGRAMMERS ARRIVE

• Adopt new languages and rapid prototyping software creation processes• Eg, the Python programming language

• Abandon NP complete processes• Vigorously assert all of the following

• Full length sequences not always needed

• Local (or no) alignment good enough

• Just stop building phylogenetic trees (for the most part)

Page 14: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

ITEM ONE IN THE BLACK BOX: NUMERICAL OTUS

• Per SOP’s of Qiime and Mothur: Create numerical OTUs• Generate enumerated clusters of sequences that are sort of close (“close enough”, say

3%)

• Pick a single sequence as a representative of each cluster

• Classify only the representative sequence which is then attributed to all sequences in the cluster

• Less classification means faster dataset processing

Page 15: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

CREATING CLUSTERS

OTU Picking with fixed radius clusters:Numerical OTUs are NOT canonical:

Completely dependent on selection rules: Order & packing heuristicWe don’t have a theoretical framework guided by biochemistry, biology, etc, to inform how the clusters are to be created… everyone is as correct as anyone else, but they are NOT DIRECTLY COMPARABLE.

Intuitive example, that has issues similar to sequence clustering. Let the radius of a circle represent the size of an OTU, eg 3%.

12

3

45

6

12

3

4 56

Page 16: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

PICKING ONE REPRESENTATIVE FOR THE CLUSTER

Which dot is the single best representative for this OTU cluster?Why?

Again: no theoretical biochemical/etc. framework to inform the selection of representatives of clusters

Arguments can be made for various approaches, but the arguments are NOT based on biochemistry or biology… they are based on statistics or computer science (which means programmer convenience)

Page 17: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

CLUSTERING YIELDS BIASED RESULTS

• Numerical OTU’s do a great job of enumerating differences between sets of sequences• Great insights via ordination

However:

• Clustering usually superposes a model (3% “species” bins) that does not fit current observations based on the Big Tree

• For medical analyses “different” organisms often appear in a single cluster

• Clustering adds a bias to the results• The “representative sequence” does not appropriately match all of the sequences in the clusters

• The true positions of individual sequences become fuzzier

Page 18: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

ITEM TWO FROM THE BLACK BOX: CLASSIFICATION

• The RDP Classifier: Naïve Bayesian Classification

• Eliminated the need for 2 computationally intensive activities: alignment & tree building

• How does it work?

• Start with unaligned sequence data and associated taxonomy lines (aka, The Training Set)

• Use Bayes Theorem to generate probability coefficients that allows very fast classification of “unknown” sequences

Bayes Theorem:

Unknown sequence Bacteria/Proteobacteria/…/E. coli

“Probabilistic Binning Cloud”

RDP Classifier

Norm Pace

Page 19: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

TRAINING SETS: UNALIGNED…• Makes use of unaligned reference sequences

G-CGTAATCGAAGGCCATTACGCTTGCGTAATGGCCCGATTACG-CGCC-TAATCG--GGCCATTACGCTTGCGTAATGGCCCGATTA-GGCGCCGT-ATCGAAGGCCATTAC-CTTG-GTAATGGCCCGAT-ACGGCGCCGTAATC---GGC-ATTACGCTTGCGTAAT-GCC-GATTACGGCGCCGTAATCGAAGGCCATTA-GCTTGC-TAATGGCCCGATTACGGC

GCGTAATCGAAGGCCATTACGCTTGCGTAATGGCCCGATTACGC..GCCTAATCGGGCCATTACGCTTGCGTAATGGCCCGATTAGGC....GCCGTATCGAAGGCCATTACCTTGGTAATGGCCCGATACGGC....GCCGTAATCGGCATTACGCTTGCGTAATGCCGATTACGGC......GCCGTAATCGAAGGCCATTAGCTTGCTAATGGCCCGATTACGGC..

Divide into groups of 8 columns: “8-mers”

Classic alignmentRetains correlationwith the secondarystructure!

RDP classifier training setLoses correlation with the secondary structure!

Using unaligned training sets changes precise boundaries into vague boundaries: Noise.

G – CC – GA – TT – AT – AA – TC – GC – GG – CG – CG – CC – GT – AA – TA – TT – AG – CC – GG – C

AA

CT T

G

Page 20: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

NAÏVE BAYES CLASSIFICATION: PROS/CONS• Good

• Very fast (Computer: just multiplications and additions)

• Ubiquitous (Qiime/Mothur/RDP website)

• Bad (from our perspective)• Training sets are often unstable… don’t get out what you put in

• Creating a “stable” training set is a black art

• To get the results you want, often have to add/delete apparently completely unrelated sequences

• The result provides no clues whatsoever as to how the classifier came up with the answer

• 100% oracle, 0% insight

• Which sequence in the reference training set was closest to an unknown?

• For very similar sequences, which few nucleotides were different?

• Does not provide an AUDIT TRAIL: critical for clinical medicine & epidemiology!

• which known species in the database was the basis for an unknown called as that species?

Page 21: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

NO UNIVERSAL CONSTANTS IN BIOLOGY…

• Biology is an intrinsically observational activity

• The process: collect and assemble anecdotes• Insight arises when a critical mass of anecdotes is accumulated

• No predictive mathematical formulations have been forthcoming, eg:• No speed of light, No E = MC2, Not even an Ohm’s law equivalent

• How many “kinds” of microbes exist on the planet FOR CERTAIN?

• How much sequence distance exists within ALL species level clades in the Big Tree FOR CERTAIN?

• In retrospect 3% species should NOT have been enshrined in microbiome tools

• For “new” organisms: it must always come back to a pairwise comparison• As it was for Linnaeus, so it is still for us.

• The “new” organism must be compared to the most similar organism that has already been documented

• The lack of a numeric predictive theoretical framework is at odds with “Bioinformatics”• Software “demands” very specific answers to questions like: what means “nearby” between two sequences?

Page 22: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

SO WHAT?

• There are limits to the precision we can get with numerical OTUs and Bayesian classification

• Where in the world are the error bars on these processes?

• Effective software solutions exist that are not based on numerical OTUs & Bayesian classification

• We are at the end of the beginning of microbiome analysis• It is time to re-evaluate all of the fundamental assumptions to get to the future

• Next: The biggest “bleeding sore”:

• The libraries of reference sequences we all use.

Page 23: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

CURATED DATABASES OF FULL LENGTH 16S SEQUENCES

• Why not just NCBI/EMBL?• no attempt at all to place sequences in a phylogenetic context.

• Submitted sequences not unambiguously derived from cultivated sources are assigned taxonomy Environmental/Uncultivated

• The two most commonly used curated 16S phylogenetic databases: Greengenes & Silva• Greengenes

• from the Pace Lab via Phil Hugenholtz to JGI. Qiime default

• Greengenes Database Consortium/2nd Genome: but current status unclear. No updates since May, 2013

• Silva • Microbial Genomics Group at the Max Planck Institute for Marine Microbiology, Bremen and the

Department of Microbiology at the Technical University Munich. Mothur default

• Well documented releases at somewhat irregular intervals; releases locked to EMBL versions.

• Latest: Silva 128, Sept 28, 2016.

• Silva >>> Greengenes

Page 24: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

THE PRECISION LIMIT: REFERENCE SEQUENCES

• The most significant microbiome tools limit: database content

• All microbiome tools are vetted against, or make intrinsic use of these curated 16S databases

• How do we know issues exist? Recent availability of many microbial GENOMES

• rRNA’s of microbial genomes are relatively “clean”: uniform, consistent, little variation within

• By comparison with genomes’ rRNAs, many database sequences have non-subtle defects

• Missing pieces, added pieces, perturbed secondary structures

• Database sequence defects source?

• the mishmash of sequencing technologies over the ages: Sanger, 454, Illumina

• We did not know what we did not know… best efforts at the time…

• Sequence databases are the ultimate Hotel California

• Sequences check into databases… but they never leave.

• Infinite academic collegiality is in force… No non-confrontational means to resolve “issues”.

Page 25: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

DATABASE CURATION IS HARD, EXPENSIVE, UNDERFUNDED

• Even the “best” rRNA database inadequate for calls to species level

• These databases are the equivalent of the literature and museums that Linnaeus used to deduce relationships: if we get them wrong, uncertainty propagates.

• Are genomes the silver bullet for high precision reference sequences?• Evaluation of rRNAs from genomes in Silva 128 finds some with defects: case-by-case scrutiny

required!• Defects: Missing pieces, added pieces, perturbed secondary structures, protein content

• Errors likely due to assembly process errors (software as oracle, again!)• Most genomicists do not go to extra effort to verify structure of rRNAs (focus is on proteins)

• But: genomes are clearly consistently better… • Current databases need to be re-evaluated in the light of the genomes’ rRNA sequences!

• Fundamental career limiting disincentive for database work:• The work is considered to be significant but NOT INNOVATIVE… therefore, NO FUNDING

Page 26: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

REFLECTIONS ON THE JOURNEY…

• Phylogenetic analysis of full length sequences is still the gold standard

• High volume analysis techniques must be characterized in light of phylogenetics• Taxonomic error bar characterization needed for microbiome analysis

• Numerical OTUs have and will continue to provide utility

• BUT: to maximize biological, biochemical, and evolutionary insight we need the most precise taxonomy calls that can be attained

• Let go of universal numeric constants!

• Reference Sequence Databases• New Focus, Means (IEEE style?), and Funding mechanism required… change the quid pro quo

for this work!

• Perhaps just a wee bit of rebalance of focus back toward biochemistry instead of software?

Page 27: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

OUR SOFTWARE SPECIFIC FUNDING

NIH R21HG005964 (Frank)

CIHR Genome Canada (Parkinson)

NIH UH2DK083994-01 (Li)

Page 28: MICROBIOME SOFTWARE: END OF BEGINNING.nas-sites.org/builtmicrobiome/files/2016/11/CharlesRobertsonNAS-MOBE-20161201Final.pdfmicrobiome software: end of beginning. dr. charles robertson

THE END


Recommended