Rapid quantification and taxonomic classification of a Rapid quantification and taxonomic classification of a complex consortium of rDNA amplicons from both complex consortium of rDNA amplicons from both
prokaryotic and eukaryotic origins using a prokaryotic and eukaryotic origins using a microarray.microarray.
CEB - ESD - LBNLTodd DeSantis, Sonya Murray, Jordan Moberg, Gary Andersen
Carol Stone (DSTL, U.K.)
What bugs are in my What bugs are in my sample?sample?
The ponderings of a toddlerThe ponderings of a toddler
Why must Mom confiscate my “Hello Kitty”
blanket on laundry day?
Will the swings be wet at the park?
How will this sausage impact the
diversity in my lower G.I. bacterial
community?
Will I inhale any archaeal
microorganisms when I visit the
hot springs?Gianna DeSantis
• Every discarded water sample, geological core, or spent air filter is lost data.
• But who wants to do all the work?– Culture? Anaerobes? non-cultivable? Safety?– Analysis of nucleic acids isolated from environment
• Must classify or sort heterogeneous nucleic acids into bins.– Restriction Fragment Length Polymorphisms (RFLP)– Single Stranded Conformation Polymorphisms (SSCP)– Temp/Denat Gradient Gel Electrophoresis (T/DGGE)– Sequencing
» Provides taxonomic nomenclature » estimates the relative abundance » Need to create, clone, & process hundreds of samples
• Can we create a simple, quantitative, comprehensive microbial test?
OutlineOutline• Goals
• Experimental Approach
• Organization of rDNA sequences into taxa (CASCADE-P)
• Assigning sets of probes for each taxa
• Using 16S GeneChip for quantitative aerosol analysis
Project OverviewProject Overview• Goal
– Create a single microarray capable of detecting and quantifying bacterial and/or archaeal organisms in a complex sample.
• Approach– Combinatorial power
of multiple probes for sequence-specific hybridization
16S rRNA gene (16S rDNA)16S rRNA gene (16S rDNA)
• Used to identify and classify organisms by gene sequence variations.
• Variations have been used in design of DNA probes for the detection of: – taxonomic domains, divisions, groups …– specific organisms
The The RibosomeRibosome
rDNA
rRNA (functional molecule)
LSU
SSU16s or 18s
The The RibosomeRibosome
• Folded secondary structure
• Essential functional component
• Conserved spans– structure must be retained for viability
– targeted for universal/group-specific PCR primers and probes
• Variable regions– spans not fundamental to the folded structure
– receive less pressure from natural selection
– probed for genus and species level discrimination
What could be What could be amplified?amplified?
• Universal 16S PCR primers complex population of amplicons.
• Must define the targets to consider as the Potential Amplicon Set.
Variable
5’ 3’
1390 1507
Region interrogated on chip
pA Ccomp 1492R
20 base DNA signature segments on chip = probe set
Sample reacts only with complementary signature sequences on chip
SSU rDNA
First generation rDNA Array uses 85-base
highly variable region of ribosomal DNA
http://greengenes.llnl.gov/http://greengenes.llnl.gov/16S16S
• Comprehensive Aligned Sequence Construction for Automated Design of Effective Probes
• Igor Dubosarskiy– Java
implementations
• Tim Harsch– RDBMS
consultations
• Lisa Corsetti– Apache module
management
• Kevin Melissare– Graphics
2.30.9.2.10
5th Level:C.ACETOBUTYLICUM_SUBGROUP
4th Level:C.BOTULINUM_GROUP
3rd Level:CLOSTRIDIUM_AND_RELATIVES
2nd Level:GRAM_POSITIVE_BACTERIA
1st Level:BACTERIA
Clostridium collagenovorans DSM 3089 (T) Clostridium sardiniensis ATCC 33455 (T) Clostridium acetobutylicum ATCC 824 (T) Clostridium acetobutylicum DSM 792 (T) Clostridium acetobutylicum ATCC 824 (T) Clostridium acetobutylicum NCDO 1712 Clostridium acetobutylicum DSM 1731
2.28.3.27.2
5th Level:ESCHERICHIA_SUBGROUP
4th Level:ENTERICS_AND_RELATIVES
(Group)
3rd Level:GAMMA_SUBDIVISION
2nd Level:PROTEOBACTERIA
1st Level:BACTERIA
U85138 clone ACK-SA7AE000452 Escherichia coli str. K-12Er.trachep Erwinia tracheiphila LMG 2906 (T)E.coliK12 Escherichia coli [gene=rrnG gene]Haf.alvei3 Hafnia alveiS.tymuriu3 Salmonella typhimurium str. Stm1Shi.boydii Shigella boydiiAF084835 str. KN4S.enterit4 Salmonella enteritidis str. SE22S.ptyphi6 Salmonella paratyphiS.typhi3 Salmonella typhi str. St111S.bovismrb Salmonella bovis morbificans Sbm1Alt.agrlyt Alterococcus agarolyticus str. ADT3Shi.flxne2 Shigella flexneri ATCC 29903 (T)
HierarchicalHierarchical Phylocodes Phylocodes
Chip TaxaChip Taxa
• Avoid groupings based on historical nomenclature.• Sequence-dependent classification by transitive
similarity clustering.
• Each sequence must end up in exactly 1 taxon.
if x R y & y R z x R z
Assigning Probes for GeneChip MicroarrayAssigning Probes for GeneChip Microarray
• Select probe sets for each taxon• Ideal Probe
• Present in all sequences of the taxon• Not present outside the taxon• Unable to X-hybe with seqs in other taxa
• Ideal Mis-match Control Probe• Unable to X-hybe to any sequence
Finding groupingsFinding groupingsseq
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
sequ
ence
s
probes
Consider A – O to be 16S sequences.
Consider 1 – 24 to be probes already embedded on the chip.
First, associate all available probes with all available sequences.
Let probe similarities drive sequence groupings.
Finding groupingsFinding groupingsseq
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
Consider A – O to be 16S sequences.
Consider 1 – 24 to be probes already embedded on the chip.
First, associate all available probes with all available sequences.
Let probe similarities drive sequence groupings.
Finding groupingsFinding groupingsseq
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
Consider A – O to be 16S sequences.
Consider 1 – 24 to be probes already embedded on the chip.
First, associate all available probes with all available sequences.
Let probe similarities drive sequence groupings.
Finding groupingsFinding groupingsseq
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
Consider A – O to be 16S sequences.
Consider 1 – 24 to be probes already embedded on the chip.
First, associate all available probes with all available sequences.
Let probe similarities drive sequence groupings.
Finding groupingsFinding groupingsseq
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
Consider A – O to be 16S sequences.
Consider 1 – 24 to be probes already embedded on the chip.
First, associate all available probes with all available sequences.
Let probe similarities drive sequence groupings.
Progressive Transitive Progressive Transitive ClusteringClustering
Count of Solved Clusters ith each Cycle's Parameters
1
10
100
1000
10000
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77
Cycle
Co
un
tTotal Clusters
Solved Clusters
uGBpplock
uPWppsep
DEFINE: upp (useful probe pair): a PM,MM pair where the 20-mer
PM complements all intra-cluster sequences AND the central 16-mer of PM does not complement any extra-cluster sequences AND the central 16-mer of the MM does not complement any sequence. Probe pairs are reassessed whenever the sequence clusters are altered.
nGBupp: number of upps for a cluster, these probe pairs globally differentiate a cluster from all other sequences.
L:the value of nGBupp which must be met for a cluster to be locked.
nPWuppA: number of useful probe pairs which pair-wise differentiate clustA from clustB
nPWuppB: number of useful probe pairs which pair-wise differentiate clustB from clustA
m: the value of nPWuppwhich must be met to inhibit two clusters from merging.
FOR L (11 .. 4) DO FOR m (1 .. 10) DO Determine nGBuppfor each cluster; Lock all clusters where nGBupp≥ L; Pair-wise compare non-locked clusters (clustA,
clustB); UNLESS (nPWuppA≥ mAND nPWuppB≥ m) Merge sequences of clustA and clustB into one
cluster; END UNLESS END FOR Uncluster non-locked clusters;END FOR
650 clusters found
cctagcatgCattctgcatacctagcatgGattctgcata
MATCHMISMATCH
Approach: Custom Affymetrix GeneChip
• Massive parallelism – Up to 500,000 probes in a 1.28 cm2 array• Identification of multiple species in a mixed population• Single nucleotide mismatch resolution
General General ProtocolProtocol
Air
Soil
Feces
Blood
Water
rRNA
gDNA
Universal 16S rDNA
PCR
Contains probes adhered to glass surface in grid
pattern.
50 µ
50 µ
AC
GG
TC
GA
AC
GG
TC
GA
AC
GG
TC
GA
AC
GG
TC
GA
AC
GG
TC
GA
Hybridize
PCR Amplify DNA
Fractionate DNA
Biotin End-label
Locating Hybridization Locating Hybridization EventsEvents
Parameter Frankia Clostridium Positive fraction 1.00 0.64Average difference 3720 625
Frankia sp. str. G48
PM MM
Clostridium butyricum
Can the chip detect Can the chip detect more than one more than one
analyte?analyte?
Combinatorial Combinatorial scoring of “Probe scoring of “Probe Sets” are able to Sets” are able to categorize mixed categorize mixed
samples.samples.
OTU % pos pairs2.30.7.12.1.013* 1002.30.7.12.1.014 46 – 572.30.7.12.1.015 54 - 612.30.7.12.1.016 39 – 542.30.7.12.1.017 182.30.7.12.2.002 112.30.7.12.2.003 142.30.7.12.2.005 14 – 322.30.7.12.2.006 18 – 322.30.7.12.2.007 21 – 252.30.7.12.2.008 14 – 292.30.7.12.3.001 7 – 252.30.7.12.3.002 82.30.7.12.3.003 42.30.7.12.3.004 7 – 112.30.7.12.3.005 4 – 142.30.7.12.3.006 112.30.7.12.3.007 14 – 292.30.7.12.3.008 72.30.7.12.3.009 4 – 112.30.7.12.3.010 0 - 42.30.7.12.4.001 21 – 362.30.7.12.4.004* 100
2.30.7.12.4.005 0 – 112.30.7.12.4.006 29 – 542.30.7.12.4.007 11 – 142.30.7.12.4.008 11
S.aureusspike
B.anthracisspike
Can the chip detect Can the chip detect more than one more than one
analyte?analyte?
OTU % pos pairs2.30.7.12.1.013* 1002.30.7.12.1.014 46 – 572.30.7.12.1.015 54 - 612.30.7.12.1.016 39 – 542.30.7.12.1.017 182.30.7.12.2.002 112.30.7.12.2.003 142.30.7.12.2.005 14 – 322.30.7.12.2.006 18 – 322.30.7.12.2.007 21 – 252.30.7.12.2.008 14 – 292.30.7.12.3.001 7 – 252.30.7.12.3.002 82.30.7.12.3.003 42.30.7.12.3.004 7 – 112.30.7.12.3.005 4 – 142.30.7.12.3.006 112.30.7.12.3.007 14 – 292.30.7.12.3.008 72.30.7.12.3.009 4 – 112.30.7.12.3.010 0 - 42.30.7.12.4.001 21 – 362.30.7.12.4.004* 100
2.30.7.12.4.005 0 – 112.30.7.12.4.006 29 – 542.30.7.12.4.007 11 – 142.30.7.12.4.008 11 Percent of probe-pairs scored positive for each probe set in the Staphylococcus Group.
Hybridization results from spike-in experiment done in
triplicate.
Sonya Murray
Aubree Hubbel
Can the chip detect Can the chip detect more than one more than one
analyte?analyte?
Combinatorial Combinatorial scoring of “Probe scoring of “Probe Sets” are able to Sets” are able to categorize mixed categorize mixed
samples.samples.
Application ExampleApplication Example
• Does air filter sample processing affect detection?– Method 1
• Wash particles from filter with SDS
• Digest particles with lysozyme
• Purify DNA using Qiagen kit
– Method 2• Pulverize filter and particles with bead mill, SDS,
P:C:ISA
• Purify DNA using MoBio kit and Sephacryl column
Bead beating allowed greater diversity to be
detected.
Quantitative AnalysisQuantitative Analysis
• Could the concentration of each amplicon in a sample be measured by fluorescence intensity?
• Experimental setup for 20 point Latin Square calibration:
Experiment Oc.oenos Fer.nod Sap.grand M.neuro H20 Environmental amplicons*
1 5 13 31 74 No Yes
2 13 31 74 143 No Yes
3 31 74 143 5 No Yes
4 74 143 5 13 No Yes
5 143 5 13 31 No Yes
6 0 0 0 0 Yes Yes
* 18uL of products from 30 cycle universal 16S PCR of gDNA extracted from U.K. air sample.
SPIKE CONCENTRATION (pM in Hybridization Solution)SPIKE CONCENTRATION (pM in Hybridization Solution)
Sonya Murray
Carol Stone
Oo Fn Sg Mn
1 5 (5474) 13 (16069) 31 (31805) 74 (124732)
2 13 (7885) 31 (61185) 74 (81107) 143 (115237)
3 31 (58912) 74 (70317) 143 (98235) 5 (8759)
4 74 (101803) 143 (69529) 5 (7789) 13 (11530)
5 143 (149869) 5 (4534) 13 (16228) 31 (56103)
6 n.a. n.a n.a. n.a.Final concentration of spike in hybridization in pM. Values in parentheses are the resulting hybridization signal in
arbitrary units (a.u.) obtained from the Latin Square experiments. All spikes were added to 18µL of products of 30 cycle universal SSU PCR of gDNA extracted from air samples using Method 2.
Log2 transformed
Linear Least Squares Regression
Pearson’s corr coeff was significant (df=18)
95% confidence intervals calculated according to: National Measurement System’s Valid Analytical Measurement Programme (VAM)
Figure 2 - Calibration Plot
y = 0.9207x + 10.504R = 0.974
9
11
13
15
17
19
0 1 2 3 4 5 6 7 8
log2 Concentration (pM)
log
2 H
ybS
core
Spike-in rDNA
Environmental rDNA
95% Confidence Limits
Spike-in Regression
Environmental community is measured with confidence intervals.
Figure 3 - Concentration of Environmental SSU Amplicons
0 20 40 60 80 100 120 140
Clostridium thermobutyricumStreptococcus anginosus
Bacillus racemilacticusPseudomonas sp.
symbiont of Solemya velumClostridium limosum+
Eurotiales (Aspergillus+)Bartonella+
Staphylococcus delphini+Vibrio parahaemolyticus+
Pasteurella sp.Heterotextus alpinus
StreptomycesStaphylococcus cohnii+
Propionibacterium lymphophilumLeucostoma persoonii
Tax
a
rDNA Concentration (pM)
Conf Interval: Conc(t(RSE)/b)(1/m+1/n+((Y-y)2) / (b2(n-1)sx2))
b = slope from regression
Y = mean of 6 replicate measurements
m = number of repeat measurements = 6
y = mean of the HybScores for the 20 points used for calibration
t = critical value obtained from t-table for 18 d.f. for 95% = 1.734
RSE = residual standard error of calibration points = 0.56
sx = standard deviation of the conc. for the 20 points used for calibration
SummarySummary
The SSU microarray was able to rapidly quantify and taxonomically classify of a complex consortium of rDNA amplicons from both prokaryotic and eukaryotic orgins.
AcknowledgementsAcknowledgements
• Gary Andersen – group Leader
• Carol Stone – sample collection, hybridization Sonya Murray - hybridizations