Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 214 times |
Download: | 0 times |
Pick a bug, any bug.Pick a bug, any bug.
The design and practical use of rDNA oligonucleotide microarrays to identify microbes in complex samples
Todd DeSantis, Igor Dubosarskiy (Ceres), Sonya Murray, Gary Andersen
Environmental Molecular Microbiology Group
BBRP - LLNL
Comprehensive Aligned Sequence Construction for Automated Design of Effective Probes (CASCADE-P)
The worries of an The worries of an ItalianItalian parentparent
Is he getting enough sleep?
Does he feel accepted?
Is there sufficient diversity in his
lower G.I. bacterial
community?
Are the straps on his car seat irritating his
neck?
Are there archaeal organisms are in the aerosols he
breathes? Enzo Salvatore DeSantis
Jenny DiGiovanni DeSantis
• Every soiled diaper sacrificed to the Diaper Genie is lost data.
• But who wants to do all the work?– Culture
• anaerobes
• non-cultivable
– Sequencing 16S rDNA• Need to create, clone, & process hundreds of samples
• Can we create a simple, quantitative, comprehensive microbial test?
OutlineOutline
• Goals
• Experimental approach
• Why create a new 16S rDNA database?
• How do you align >65,000 sequences?
• Organization of sequences into types
• Designing probes for each type
• Reassessing probe specificity as database grows
• Using 16S GeneChip for quantitative aerosol analysis
Project OverviewProject Overview
• Create a single GeneChip® capable of detecting and quantifying bacterial and/or archaeal organisms in a complex sample.
• What is in a sample, as opposed to what is not in a sample.
• Approach– Combinatorial power of multiple probes for sequence-
specific hybridization
General ProtocolGeneral ProtocolSample
Air
Soil
Feces
Blood
rRNA
gDNA
Universal 16S rDNA PCR
Random hexamer total gDNA amplification
Allen Christian
Contains probes adhered to glass surface in grid pattern.
16S rRNA gene (16S rDNA)16S rRNA gene (16S rDNA)
• Used to identify and classify organisms by gene sequence variations.
• Variations have been used in design of DNA probes for the detection of: – taxonomic domains, divisions, groups …– specific organisms
The The RibosomeRibosome
• Folded secondary structure
• Essential functional component
• Conserved spans– structure must be retained for viability
– targeted for universal/group-specific PCR primers and probes
• Variable regions– spans not fundamental to the folded structure
– receive less pressure from natural selection
– probed for genus and species level discrimination
Building upon two Building upon two decades of 16S gene decades of 16S gene
catalogingcataloging• Over 75,000 16S records housed at
NCBI.• Grows every week.• Don’t need to build a reference
library for organism ID.– FAME– MS– Surface Antigen Identification
What could be What could be amplified?amplified?
• Universal 16S PCR primers complex population of amplicons.
• Must define the targets to consider as the Potential Amplicon Set or PAS.
• “Give me a file of all the amplicon sequences possible if we used universal primers upon a sample containing every prokaryote.”
Variable
Tom Kusmarski
Difficulties defining the PASDifficulties defining the PAS
• Why Entrez search for “16S” is insufficient:– non-16S sequences are errantly deposited as 16S – 16S sequences may be concealed within longer
records that cover entire operons or genomes – anti-sense strands aren't specified as such
Difficulties defining the PASDifficulties defining the PAS
• For each sequence, need to trim away bases that won’t be amplified.
• Problem: Most 16S records are “partial”.
• Can primer pattern matching within sequences allow for proper trimming?– What does it mean when a primer search fails?
• Primer locus present in record but mutated, or
• Primer locus outside the sequence span deposited
Difficulties defining the Difficulties defining the PASPAS
• Aligned sequences were necessary.
• A 16S MSA arranged as horizontal rows of characters allows vertical slices to be extracted between columns of primer annealing positions.
Existing 16S aligned Existing 16S aligned databasesdatabases
• Under 20,000 sequences among 3 databases.– Updates occurred annually or worse.– Others focused on hand-aligning complete sequences.
• Structure predictions• Phylogeny assessment
• Comprehensive Aligned Sequence Construction for Automated Design of Effective Probes (CASCADE-P)– need up-to-date records– need to include partial sequences
• add “inertia” to region considered conserved• increase the likelihood of detecting a polymorphism • searched for unwanted cross-hybridizations with a tentative probe
2.30.9.2.10
5th Level:C.ACETOBUTYLICUM_SUBGROUP
4th Level:C.BOTULINUM_GROUP
3rd Level:CLOSTRIDIUM_AND_RELATIVES
2nd Level:GRAM_POSITIVE_BACTERIA
1st Level:BACTERIA
Clostridium collagenovorans DSM 3089 (T) Clostridium sardiniensis ATCC 33455 (T) Clostridium acetobutylicum ATCC 824 (T) Clostridium acetobutylicum DSM 792 (T) Clostridium acetobutylicum ATCC 824 (T) Clostridium acetobutylicum NCDO 1712 Clostridium acetobutylicum DSM 1731
2.28.3.27.2
5th Level:ESCHERICHIA_SUBGROUP
4th Level:ENTERICS_AND_RELATIVES
(Group)
3rd Level:GAMMA_SUBDIVISION
2nd Level:PROTEOBACTERIA
1st Level:BACTERIA
U85138 clone ACK-SA7AE000452 Escherichia coli str. K-12Er.trachep Erwinia tracheiphila LMG 2906 (T)E.coliK12 Escherichia coli [gene=rrnG gene]Haf.alvei3 Hafnia alveiS.tymuriu3 Salmonella typhimurium str. Stm1Shi.boydii Shigella boydiiAF084835 str. KN4S.enterit4 Salmonella enteritidis str. SE22S.ptyphi6 Salmonella paratyphiS.typhi3 Salmonella typhi str. St111S.bovismrb Salmonella bovis morbificans Sbm1Alt.agrlyt Alterococcus agarolyticus str. ADT3Shi.flxne2 Shigella flexneri ATCC 29903 (T)
RDP alignment and tree RDP alignment and tree – a great skeleton– a great skeleton
• Ribosomal Database Project (Michigan State)
• 16S MSA of 16 277 seqs (v8.1)
• trimmed of extra-16S
• top-strand oriented
• 1,541 bases stretched to 4,218 characters
• Each placed within a hierarchical phylogenetic tree
Sequence pre-Sequence pre-processingprocessing
• Download ‘16S’ candidates via “ESearch”
• BLAST compare to 16S/18S RDP “standards”.– Candidates were rejected if
• the longest match length was <300 base pairs
• the highest scoring BLAST subject was eukaryotic
• candidate matched sequences in two or more RDP terminal tree branches equally well
– Phylocode assigned from top HSP
Sequence pre-Sequence pre-processingprocessing
• Candidate trimmed of extra-16S seq data– tRNA genes, intergenic
spacer regions, and 23S rDNA
– based on HSP boundries
• If HSP paired opposite strands, candidate was reverse complemented.
Sequence pre-Sequence pre-processingprocessing
• The "template" was assigned from the top HSP from a second BLAST process – G=1, E=1.
– Favors longer, but less identical matches.
Prokaryotic Multiple Sequence AlignmentProkaryotic Multiple Sequence Alignment prokMSAprokMSA
• Essentially, the prokMSA was a merger built serially by aligning each candidate to its closest relative in the RDP tree.
• Two Steps– Align0
• Publicly available, pair-wise SW aligner• Candidate expansion
– NAST • Nearest Alignment Space Termination• Novel algorithm• Candidate compression
NASTNASTDEFINE
St = post-Align0 template sequence.
Sc = post-Align0 candidate sequence.
Ht = alignment space (hyphen) inserted into
St by Align0.
Hc = alignment space (hyphen) inserted into
Sc by Align0.
WHILE (St contains one or more Ht) DO
LHt = character index of distal 5' Ht
within St
L5' = character index of Hc within Sc
which is 5' proximal to Ht
L3' = character index of Hc within Sc
which is 3' proximal to Ht
IF ((LHt – L5') > (L3' – LHt)) Delete Hc
found at L3'
ELSE Delete Hc found at L5'
Delete template gap character.
END WHILE
Operational Taxonomic Operational Taxonomic UnitsUnits
• We did not desire to design probes for each sequence.– Many sequences are nearly identical.– Desired 20 probe per sequence:
• 60,000 seqs * ( 20 probes + 20 probes) = • 2.4 million probes (not possible, yet)
• We did desire to design probes for each type of sequence.
• Need to group sequences into types amenable to probe design.
Operational Taxonomic Operational Taxonomic UnitsUnits
• Avoid groupings based on historical nomenclature.• Sequence-dependent classification by transitive
similarity clustering at 98%.
• Create groupings into Operational Taxonomic Units (OTU).
• Each sequence must be in exactly 1 OTU
if x R y & y R z x R z
BACTERIA (2) GRAM_POSITIVE_BACTERIA (2.30) BACILLUS-LACTO-STREPTOCOC_SUBDIVISION (2.30.7) CARNOBACTERIUM_GROUP (2.30.7.18) CRN.DIVERGENS_SUBGROUP (2.30.7.18.2) OTU 2.30.7.18.2.012 (11 sequence records) AF244371 Nostocoida limicola I Ben200 AF244372 Nostocoida limicola I Ben201 AF244375 Nostocoida limicola I Ben77 AF255736 Nostocoida limicola I AF276462 Uncultured bacterium clone RFLP 102E AF394926 Lactosphaera sp. PMagG1 AJ296179 Ruminococcus palustris DSM 9172T AJ306612 Trichococcus collinsii 37AN3 L76599 Lactosphaera pasteurii ATCC 35945 X87150 Lactosphaera pasteurii DSM 2381 Y17301 Trichococcus flocculiformis DSM 2094
L765
99
X871
50
Y173
01
AF27
6462
AF24
4371
AF39
4926
AF25
5736
AJ29
6197
AF24
4375
AF24
4372
AJ30
6612 Identity
Score Matrix
1.000 0.991 0.987 0.954 0.963 0.992 0.991 0.986 0.991 0.977 0.992 L765991.000 0.991 0.957 0.995 0.999 0.995 0.991 0.996 0.994 0.999 X87150
1.000 0.952 0.991 0.989 0.990 0.983 0.990 0.989 0.989 Y173011.000 0.956 0.957 0.956 0.956 0.990 0.955 0.957 AF276462
1.000 0.994 0.995 0.992 0.994 0.993 0.994 AF2443711.000 0.995 0.993 0.996 0.993 1.000 AF394926
1.000 0.988 0.994 0.994 0.994 AF2557361.000 0.996 0.991 0.989 AJ296197
1.000 0.993 0.996 AF2443751.000 0.993 AF244372
1.000 AJ306612
Sequences ClusteredSequences Clustered
Is = IB Lm / min (La, Lb)
Distribution of Sequences in OTUs
1
10
100
1000
10000
1 2 3 4 56
to 1
011
to 2
021
to 5
051
to 1
0010
1 to
200
over
200
Number of Sequences per OTU
Co
un
t o
f O
TU
s
Some OTUs contain hundreds of sequences.
Example: Many isolates of a human pathogen.
Some “species” are found in over 20 OTUs.
•Bioinformatics manuscript input
•Sonya Murray
•Peter Agron
•Sadhana Chauhan
http://greengenes.llnl.gov/http://greengenes.llnl.gov/16S16S
• Comprehensive Aligned Sequence Construction for Automated Design of Effective Probes
• Igor Dubosarskiy– Java
implementations
• Tim Harsch– RDBMS
consultations
• Lisa Corsetti– Apache module
management
• Kevin Melissare– Graphics
““My Interest List”My Interest List”
• Able to define a region of the tree that is important to you.
• Your list will be remembered between visits.
Picking Probes for GeneChip MicroarrayPicking Probes for GeneChip Microarray
27 1507
pA 1492R
• Select 20 to 28 probe pairs for each of 8,432 OTUs• Ideal Perfect Match Probe
• 25mer • Present in all sequences of the OTU• Not present outside the OTU• Unable to X-hybe with seqs in other OTUs
• Ideal Mis-match Control Probe• Unable to X-hybe within entire PAS
Slice taken from prokMSA
Scoring Probe Scoring Probe CandidatesCandidates
• Score is calculated for each potential probe pair.
• Product of 3 factors:– Locus Specific Prevalence Factor– PerfectMatch X-hybe Factor
• The closer the tree distance, the lower the factor
– MisMatch X-hybe Factor • 3 bases available
• use base which produces highest factor
Cross HybridizationCross Hybridization
Central Data Structure:$17mer_hash{AGCTATTATAGCTGCAG}{‘2.30.7.12.4.004’} = 1
$17mer_hash{AGCTATTATAGCTGCAG}{‘2.30.7.12.4.009’} = 1
$17mer_hash{AGCTATTATAGCTGCAG}{‘2.30.7.12.3.001’} = 1
…
1.2 Gb
Allowed rapid lookup of all OTUs containing a particular 17mer
Consultations:
Mark Wagner
Tom Slezak
Tom Kuczmarski
Mike Mittman, Affymetrix
Combinatorial Combinatorial scoring of “Probe scoring of “Probe Sets” are able to Sets” are able to categorize mixed categorize mixed
samples.samples.
OTU % pos pairs2.30.7.12.1.013* 1002.30.7.12.1.014 46 – 572.30.7.12.1.015 54 - 612.30.7.12.1.016 39 – 542.30.7.12.1.017 182.30.7.12.2.002 112.30.7.12.2.003 142.30.7.12.2.005 14 – 322.30.7.12.2.006 18 – 322.30.7.12.2.007 21 – 252.30.7.12.2.008 14 – 292.30.7.12.3.001 7 – 252.30.7.12.3.002 82.30.7.12.3.003 42.30.7.12.3.004 7 – 112.30.7.12.3.005 4 – 142.30.7.12.3.006 112.30.7.12.3.007 14 – 292.30.7.12.3.008 72.30.7.12.3.009 4 – 112.30.7.12.3.010 0 - 42.30.7.12.4.001 21 – 362.30.7.12.4.004* 100
2.30.7.12.4.005 0 – 112.30.7.12.4.006 29 – 542.30.7.12.4.007 11 – 142.30.7.12.4.008 11
S. aureus spike
B. anthracis spike
Art Koybayashi – Simulation package
OTU % pos pairs2.30.7.12.1.013* 1002.30.7.12.1.014 46 – 572.30.7.12.1.015 54 - 612.30.7.12.1.016 39 – 542.30.7.12.1.017 182.30.7.12.2.002 112.30.7.12.2.003 142.30.7.12.2.005 14 – 322.30.7.12.2.006 18 – 322.30.7.12.2.007 21 – 252.30.7.12.2.008 14 – 292.30.7.12.3.001 7 – 252.30.7.12.3.002 82.30.7.12.3.003 42.30.7.12.3.004 7 – 112.30.7.12.3.005 4 – 142.30.7.12.3.006 112.30.7.12.3.007 14 – 292.30.7.12.3.008 72.30.7.12.3.009 4 – 112.30.7.12.3.010 0 - 42.30.7.12.4.001 21 – 362.30.7.12.4.004* 100
2.30.7.12.4.005 0 – 112.30.7.12.4.006 29 – 542.30.7.12.4.007 11 – 142.30.7.12.4.008 11 Percent of probe-pairs scored positive for each probe set in the Staphylococcus Group.
Hybridization results from spike-in experiment done in
triplicate.
Sonya Murray
Aubree Hubbel
Combinatorial Combinatorial scoring of “Probe scoring of “Probe Sets” are able to Sets” are able to categorize mixed categorize mixed
samples.samples.
PAS is a moving targetPAS is a moving target• Problems
– New 16S sequence data is constantly being deposited to public databases.
– Mismatch Probes can become Perfect Matches.– Phylogenetic groupings (OTUs) can change.
• New transitive links may be discovered
Opportunities
Name=2.30.7.12.004
BlockNumber=1
NumAtoms=33
NumCells=66
CellHeader=X Y PROBE FEAT QUAL
Cell1=6 59 TCAAACATTGCGGGCTTCAG 2.30.7.12.004
Cell2=6 58 TCAAACATTGTGGGCTTCAG 2.30.7.12.004
Cell3=171 202 AAACATTGTGGGCTTCAGCC 2.30.7.12.004
Cell4=171 203 AAACATTGTGTGCTTCAGCC 2.30.7.12.004
Cell5=197 163 CATTGTGGGCATCAGCCACC 2.30.7.12.004
Cell6=197 162 CATTGTGGGCTTCAGCCACC 2.30.7.12.004
Cell7=151 175 ATTGTGGGCTGCAGCCACCC 2.30.7.12.004
Cell8=151 174 ATTGTGGGCTTCAGCCACCC 2.30.7.12.004
Cell9=228 2 TTGTGGGCTTCAGCCACCCC 2.30.7.12.004
Cell10=228 3 TTGTGGGCTTTAGCCACCCC 2.30.7.12.004
Cell11=139 22 TGTGGGCTTCAGCCACCCCA 2.30.7.12.004
Cell12=139 23 TGTGGGCTTCGGCCACCCCA 2.30.7.12.004
Cell13=94 76 GGGCTTCAGCCACCCCATTG 2.30.7.12.004
Cell14=94 77 GGGCTTCAGCTACCCCATTG 2.30.7.12.004
Cell15=111 118 CTTCAGCCACCCCATTGGAA 2.30.7.12.004
…
…
…
…
…
…
DynamicDynamic probe probe associationsassociations
• Ability to “re-map” chip to up-to-date 16S data
• Many of the existing probes on the physical array are complementary to “new” sequences.
• Probes originally deemed MM can become PM to some organisms.
• mySQL database used for association maintenance
original (static) dynamic
Finding groupingsFinding groupingsseq
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
sequ
ence
s
probes
Consider A – O to be 16S sequences.
Consider 1 – 24 to be probes already embedded on the chip.
First, associate all available probes with all available sequences.
Let probe similarities drive sequence groupings.
Finding groupingsFinding groupingsseq
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
Consider A – O to be 16S sequences.
Consider 1 – 24 to be probes already embedded on the chip.
First, associate all available probes with all available sequences.
Let probe similarities drive sequence groupings.
Finding groupingsFinding groupingsseq
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
Consider A – O to be 16S sequences.
Consider 1 – 24 to be probes already embedded on the chip.
First, associate all available probes with all available sequences.
Let probe similarities drive sequence groupings.
Finding groupingsFinding groupingsseq
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
Consider A – O to be 16S sequences.
Consider 1 – 24 to be probes already embedded on the chip.
First, associate all available probes with all available sequences.
Let probe similarities drive sequence groupings.
Finding groupingsFinding groupingsseq
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
Consider A – O to be 16S sequences.
Consider 1 – 24 to be probes already embedded on the chip.
First, associate all available probes with all available sequences.
Let probe similarities drive sequence groupings.
Progressive Cyclical Progressive Cyclical GroupingGrouping
DEFINE uGBpp as the number of useful probe pairs which globally differentiate a cluster from all other sequences.
FOR uGBpplock (11 .. 4) DO
FOR uPWppsep (1 .. 10) DO
Determine uGBppclust for each cluster.
Lock all clusters where uGBppclust ≥ uGBpplock .
Pair-wise (PW) compare each non-locked cluster (clustA, clustB)
uPWppclustA = number of useful probe pairs which PW differentiate clustA from clustB
uPWppclustB = number of useful probe pairs which PW differentiate clustB from clustA
Merge sequences of clustA and clustB into one cluster unless uPWppclustA ≥ uPWppsep AND uPWppclustB ≥ uPWppsep
END FOR
Separate all sequences in non-locked clusters so that each sequence is the sole element of its own cluster.
END FOR Count of Solved Clusters ith each Cycle's Parameters
1
10
100
1000
10000
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77
Cycle
Co
un
tTotal Clusters
Solved Clusters
uGBpplock
uPWppsep
Quantitative AnalysisQuantitative Analysis
• Could the concentration of each amplicon in a sample be measured by fluorescence intensity?
• Experimental setup for 20 point calibration:
Experiment Lc.oenos Fer.nod Sap.grand M.neuro H20 16S amplicons*
1 5 13 31 74 No Yes
2 13 31 74 143 No Yes
3 31 74 143 5 No Yes
4 74 143 5 13 No Yes
5 143 5 13 31 No Yes
6 0 0 0 0 Yes Yes
* 18uL of products from 30 cycle universal 16S PCR of gDNA extracted from U.K. air sample.
SPIKE CONCENTRATION (pM in Hybridization Solution)SPIKE CONCENTRATION (pM in Hybridization Solution)
Sonya Murray
Carol Stone
Quantitative AnalysisQuantitative Analysis
Calibration Graph
y = 0.8544x + 10.906R = 0.9294
5
7
9
11
13
15
17
19
0 1 2 3 4 5 6 7 8
log2 concentration
log2
Ave
rage
Diff
eren
ce
spike-ins
Confidence Limits
Environmental Amplicons
Linear Regression
Log2 transformed
Linear Least Squares Regression
Pearson’s corr coeff was significant (df=18)
95% confidence intervals calculated according to: National Measurement System Valid Analytical Measurement Programme (VAM)
Quantitative AnalysisQuantitative Analysis
• Environmental community is measured with confidence intervals.
Concentration of Environmental SSU Amplicons detected w ith SSU chip. Error bars indicate 95% confidence level.
0 25 50 75 100 125 150 175
2.30.9.2.5sid5429
2.30.7.21.6sid5217
2.30.7.9.2sid3596
2.28.3.5sid2182
2.28.3.99sid2099
2.28.1.6.99sid10172
2.30.7.12.1sid10148
2.30.9.2.99sid10459
3.9.1.4.4.1sid8086
2.28.1.4.1sid2243
2.28.3.5sid3966
2.30.1.10.6sid6385
2.30.1.8.1.9sid9990
2.30.7.12.1sid10221
3.9.2.1.1sid8152
3.9.1.4.8.1sid7423
pro
be
se
ts
Concentration (pM)
SummarySummary
• prokMSA contains 65,000 aligned sequences and growing (largest collection).
• Over 8,000 distinct OTUs have been found.
• Global probe-picking was completed.
• “DOE 16S” GeneChip® was manufactured.
• Ability to correctly categorize spike-ins is being validated.
• Detected amplicons can be quantified.
AcknowledgementsAcknowledgements• Gary Andersen – Group Leader (The Tangent Terminator)• Carol Stone – Sample collection, hybridization (DSLT)• Aubree Hubbel – Spike synthesis• Sonya Murray - Hybridizations• Peter Agron – ms advise• Sadhana Chauhan – ms advice• Mike Mittman – probe selection constraints (Affymetrix)• Art Koybayashi – Hyb simulation package, ms advice• Tom Kusmarski - PAS, algorithm optimization• Tom Slezak - algorithm optimization• Mark Wagner - algorithm optimization• Igor Dubosarskiy – Java, web front-end (Ceres)• Tim Harsch - RDBMS• Lisa Corsetti – Apache administration• Allen Christensen – genomic sample amplification• This work was performed under the auspices of the U.S. Department of Energy by the University
of California, Lawrence Livermore National Laboratory, under contract no. W-7405-Eng-48.