Identification of Bacteria Using Phylogenetic Relationships Revealed by MS/MS Sequencing
of Tryptic Peptides Derived from Cellular Proteins
Jacek P. DworzanskiGeo-Centers, Inc.,
Aberdeen Proving Ground, MD 21010-0068
Samir Deshpande1; Rui Chen2; A. Peter Snyder3;Liang Li2 and Charles H. Wick3
1Science and Technology Corporation, Edgewood, MD 21040; 2Department of Chemistry, University of Alberta, Edmonton, Alberta T6G 2G2; 3 U.S. Army Edgewood Chemical Biological Center,
Aberdeen Proving Ground, MD 21010-5424
2004 Joint Service Scientific Conference on Chemical & Biological Defense Research
15-17 November 2004; Hunt Valley, Maryland
Report Documentation Page Form ApprovedOMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering andmaintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, ArlingtonVA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if itdoes not display a currently valid OMB control number.
1. REPORT DATE 17 NOV 2004
2. REPORT TYPE N/A
3. DATES COVERED -
4. TITLE AND SUBTITLE Identification of Bacteria Using Phylogenetic Relationships Revealed byMS/MS Sequencing of Tryptic Peptides Derived from Cellular Proteins
5a. CONTRACT NUMBER
5b. GRANT NUMBER
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) Science and Technology Corporation, Edgewood, MD 21040
8. PERFORMING ORGANIZATIONREPORT NUMBER
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release, distribution unlimited
13. SUPPLEMENTARY NOTES See also ADM001849, 2004 Scientific Conference on Chemical and Biological Defense Research. Held inHunt Valley, Maryland on 15-17 November 2004 . , The original document contains color images.
14. ABSTRACT
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT
UU
18. NUMBEROF PAGES
27
19a. NAME OFRESPONSIBLE PERSON
a. REPORT unclassified
b. ABSTRACT unclassified
c. THIS PAGE unclassified
Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18
Category B
1. Brucella species (brucellosis)2. Burkhoderia pseudomallei 3. Burkholderia mallei (glanders) 4. Campylobacter jejuni5. Clostridium perfringens (epsilon
toxin)6. Coxiella burnetti (Q fever)7. Escherichia coli (diarrheagenic)8. Listeria monocytogenes9. Rickettsia prowazekii (typhus fever)10. Salmonella11. Shigella species12. Staphylococcus aureus (enterotoxin
B)13. Vibrios (pathogenic)14. Yersinia enterocolitica
National Institute of Allergy and Infectious DiseasesNational Institute of Health
Category A, B & C Priority Bacterial PathogensCategory A
1. Clostridium botulinum2. Bacillus anthracis (anthrax) 3. Francisella tularensis (tularemia)4. Yersinia pestis
Category C
1. Mycobacterium tuberculosis(multiple drug resistant)
2. Rickettsias (other)
Genomes of all above organisms have been sequenced
Prokaryotic Ongoing Genome Projects: 489 {528}Archaeal: 28 {27} Bacterial: 461 {499}
5/12/2004 versus {11/12/2004}
Number of Fully Sequenced Genomes of Eubacteria: 145 {178} (From 1995)Number of Fully Sequenced Genomes During Last 12 Months: 55 {68}
Bacillus anthracis A1055 (Group C) TIGRNIAIDBacillus anthracis Ames Ancestor TIGRNIAID / NSFBacillus anthracis Ames Florida TIGRNSFBacillus anthracis Austrailia 94 (GT55, Group A3a) TIGRNIAIDBacillus anthracis CNEVA-9066 (GT79 Group B2) TIGRNIAIDBacillus anthracis Kruger B (GT87 Group B1) TIGRNIAIDBacillus anthracis Vollum (GT77 Group A4) TIGRNIAIDBacillus anthracis Western N. America (GT3 Group A1a)TIGRNIAIDBacillus anthracis STN DOE/JGIBacillus anthracis ZK DOE/JGI
Example of Ongoing Genome Projects
First, some terminology…Taxonomy - the science of naming and classifying organisms;
Classification - placement of an organism within a scheme relating different groups of organisms;
Identification - the determination of whether an organism should be placed within a group of organisms known to fit within some classification scheme;(the practical use of classification criteria)
Phylogenetics - focuses on evolutionary relationships betweenorganisms or genes/proteins
Phylogenetic ApproachThe ideal means of identifying and classifying bacteria would be to compare each gene sequence in a given strain with the gene sequences for every known species.
Taxonomy of Bacteria(Linnaean System)
Bacteria Firmicutes Bacilli BacillalesBacillaceae Bacillus Bacillus subtilis
Bacteria Proteobacteria γ-ProteobacteriaEnterobacteriales EnterobacteriaceaeEscherichia Escherichia coli
KingdomPhylumClassOrderFamilyGenusSpecies
Examples: Bacillus subtilis & Escherichia coli
Universal Phylogenetic Tree of Bacteria Based on SSU rRNA Sequences
Aquificae
Termotogae
Planctomycetes
ActinobacteriaFirmicutes
CyanobacteriaFusobacteria
Proteobacteria
SpirochaetesChlamydiae
Deinococcus-Thermus
Bacteroidetes/Chlorobi
0.1 base substitution per nucleotide
Archaea, Eucarya
Reepresentatives of Marked PhylogeneticDivisions (Phyla) of Bacteria Are Included in the Database
Reepresentatives of Marked PhylogeneticDivisions (Phyla) of Bacteria Are Included in the Database
Approach: (1) Bacterial Sample Processing
Peptides
Lysis Digestion(trypsin)
ProteinsBacteria
Peptides
Lysis Digestion(trypsin)
ProteinsBacteria
Flow of Genetic Information
mRNA
DNA CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
Protein
transcription
translation
Flow of Genetic Information
mRNA
DNA CCTGAGCCAACTATTGATGAACCTGAGCCAACTATTGATGAA
PEPTIDEPEPTIDE
CCUGAGCCAACUAUUGAUGAACCUGAGCCAACUAUUGAUGAA
Protein
transcription
translation The relation between the sequence of bases in DNA and the sequence of amino acids in a protein
The relation between the sequence of bases in DNA and the sequence of amino acids in a protein
Mass SpectrometryMass SpectrometryElectrospray Ionization
PowerSupply
+ _
++
++
++
++++
+
++
+ + +
--
-
--
--
--
-
-
---
-+
+
++
++
++++
+
++
+ + +
--
-
--
--
--
-
-
---
-
Electrospray Ionization
PowerSupply
+ _
++++++
PEPTIDE
m1m6 m4 m3 m2m7 m5 m1m1m6m6 m4m4 m3m3 m2m2m7m7 m5m5
FragmentIons Out
++
++
++++
++++++ ++ ++++ ++ ++ ++ ++++
++++
++++
++++
++
++++ ++++ ++++++++ ++++ ++++ ++++ ++++++++
++++++++
++++++++
++++++++
++++
Peptide IonsIn
Approach: (2) Tandem Mass Spectrometry of Peptide Ions
PEPTID
PEPTIDR
m7
P #1
PEPTIDKP
PE
PEP
PEPT
PEPTI
Mass/Chargem6m5m4m 3m2m1
Inte
nsity
PEPTIDE
P
PE
PEP
PEPT
PEPTI
PEPTID
Mass/Chargem6m5m4m 3m2 m7m1
Inte
nsity
PEPTIDE
P
PE
PEP
PEPT
PEPTI
PEPTID
PEPTIDE
Mass/Chargem6m5m4m 3m2 m7m1
Inte
nsity
PEPTIDE P #3 P #2
MS/MS Spectra of Peptide Ions
P #1
P #2
P #3
Unknown Bacterium
Approach: (3) Sequencing of Peptide Ions
B. anthracis V. choleraeY. pestis
P #1
P #2
P #3 P #3
Database
Matching
P #1
P #2
P #3
Unknown Bacterium
Identified agentY. pestis
Approach: (4) Matching of Identified Tryptic Peptides to Theoretical Peptides of Database Bacterial Proteomes
X1X2X3…K
X1X2X3…R
TrypticPeptides
X1X2X3…K
X1X2X3…R
Virtual Array of Peptide Sequences
1 2 3 n = 170…..
••••••
ORF m
ORF 3
ORF 2
ORF 1
••••
••
Bacterial Proteomes
Tran
slat
ed P
rote
in S
eque
nces
m = 500 to 7,500
Total Number of Protein Coding Sequences: 500,000
Matching
Classification and Identification of Unknown Bacterium
…..X1X2X3…K s1X1X2X3…R s2
…..…..…..…..…..…..…..
…..
…..
…..…..…..
X1X2X3…K s3
X1X2X3…R sm
…..
u b2 b3 bn
Matrix of Assignments
Sequ
ence
s
…..
…..Bacterial Proteomes
…..
b1Tryptic peptides
…..X1X2X3…K s1X1X2X3…R s2
…..…..…..…..…..…..…..
…..
…..
…..…..…..
X1X2X3…K s3
X1X2X3…R sm
…..
u b2 b3 bn
Matrix of Assignments
Sequ
ence
s
…..
…..Bacterial Proteomes
…..
b1Tryptic peptides
Cluster Analysis
u b2 b3 b4 b5b1u b2 b3 b4 b5b1
Principal ComponentAnalysis
F1F2
F3
F1F2
F3
F1F2
F3
SIM
-Sco
re
b1b2 b3 bn…..
100
Affinity Histogram
Screen Capture Image Displaying Raw LC-MS/MS Data
L. lactis
Discriminant Analysis ParametersSEQUEST Output Data
AssignmentMatrix
Bacterium Number
SIM
-Sco
re
Filters
Screen Capture Image Displaying Accepted Peptide Assignments(P = 90%)
L. lactis
Screen Capture Image Displaying Accepted Peptide AssignmentsAfter Removal of Degenerate Peptide Sequences
L. lactis
Phylum1 2 3 4 5 6 7 8 9 10 11 12
SIM
-Sco
re
0
20
40
60
80
100
Phylum1 2 3 4 5 6 7 8 9 10 11 12
ID-S
core
0
20
40
60
80
100
Prot
eoba
cter
ia
Firmicutes
F2
FirmicutesPhylum level
Class1 2 3
SIM
-Sco
re
0
20
40
60
80
100
Class1 2 3
ID-S
core
0
20
40
60
80
100
F2
Bacilli
Clo
str id
ia
Mol
licut
es
BacilliClass level
Order1 2
SIM
-Sco
re
0
20
40
60
80
100
Order
1 2
ID-S
core
0
20
40
60
80
100
F2
Bacillales
Lact
obac
illal
es
BacillalesOrder level
Family1 2 3
SIM
-Sco
re
0
20
40
60
80
100
Family1 2 3
ID-S
core
0
20
40
60
80
100
F2
Ente
roco
ccac
eae
Lact
obac
illac
eae
Stre
ptoc
occa
ceae
Stre
ptoc
occa
ceae
Family level
Genus1 2
SIM
-Sco
re
0
20
40
60
80
100
Genus1 2
ID-S
core
0
20
40
60
80
100
F2
Lact
ococ
cus
Stre
ptoc
occu
s
Lact
ococ
cus
Genus level
Species1 2 3 4 5 6
SIM
-Sco
re
0
20
40
60
80
100
Species1 2 3 4 5 6
ID-S
core
0
20
40
60
80
100
F2
L. lactis
S. m
utan
sS.
pne
umon
iae
R6
S. p
neum
onia
e R
T4S.
pyo
gene
sS.
aga
lact
iae
L. lactis
Species level
Identification of Bacteria Using Phylogenetic Relationships Revealed
by MS/MS Sequencing of Tryptic Peptides
(Lactococcus lactis)
Analysis of Bacterial Mixture[E. coli (K-12) and B. subtilis (2:1), w:w]
Phylum (Division)1 2 3 4 5 6 7 8 9 10 11 12
ID-S
core
0
10
20
30
40
50
60Phylum Level
Firmicutes
Proteobacteria
Spirochaetes
Class1 2 3
ID-S
core
0
20
40
60
80
100
BacilliFirmicutes
Class1 2 3 4
ID-S
core
0
20
40
60
80
100
γ-Proteo-bacteria
Proteobacteria
Species1 2 3 4
ID-S
core
0
20
40
60
80
100
B. subtilisBacillus
Strain1 2 3 4
ID-S
core
0
20
40
60
80
100
E. coli K-12Escherichia
BacillalesBacillaceae
Enterobacteriales Enterobacteriaceae
Identification of Bacteria Using Phylogenetic Relationships Revealed
by MS/MS Sequencing of Tryptic Peptides
(Bacillus cereus)Phylum (Division)
1 2 3 4 5 6 7 8 9 10 11 12
SIM
Sco
re
0
20
40
60
80
100
Phylum (Division)
1 2 3 4 5 6 7 8 9 10 11 12
ID S
core
0
20
40
60
80
100
Firmicutes
Prot
eoba
cter
ia
B act
e roi
dete
s /C h
l or o
bi G
rou p
Firmicutes
F2
Phylum level
Phylum (Division)
1 2 3 4 5 6 7 8 9 10 11 12
SIM
Sco
re
0
20
40
60
80
100
Phylum (Division)
1 2 3 4 5 6 7 8 9 10 11 12
ID S
core
0
20
40
60
80
100
Firmicutes
Prot
eoba
cter
ia
B act
e roi
dete
s /C h
l or o
bi G
rou p
Firmicutes
F2
Phylum level
Firmicutes
Classes
1 2 3
Gen
omic
Sim
ilarit
ies
Scor
e
0
20
40
60
80
100
Classes
1 2 3
ID S
core
0
20
40
60
80
100Bacilli
Clo
str id
ia
Mol
licut
es
Bacilli
F2
Class level
Order
1 2
Gen
omic
Sim
ilarit
ies
Scor
e
0
20
40
60
80
100
Order
1 2
ID S
core
0
20
40
60
80
100
Bacillales
Lact
obac
illal
es
Bacillales
F2
Order level
Family
1 2 3
ID S
core
0
20
40
60
80
100
Family
1 2 3
Gen
omic
Sim
ilarit
ies
Scor
e
0
20
40
60
80
100
Bacillaceae
List
eria
ceae
Stap
hylo
cocc
acea
e
Bacillaceae
F2
Family level
Species
1 2 3 4
Gen
omic
Sim
ilarit
ies
Scor
e
0
20
40
60
80
100
Species
1 2 3 4
ID S
core
0
20
40
60
80
100
B.h
alo d
u ra n
s
B. cereusgroup
B. s
ubtil
is
O. i
hey e
nsi s
B. cereusgroup
F2
Species level
Species1 2 3 4 5 6 7 8 9 10
Gen
omic
Sim
ilarit
ies
Sco
re
0
20
40
60
80
100
Species1 2 3 4 5 6 7 8 9 10
ID S
core
0
20
40
60
80
100
F2
aa a a
cc t
h s i
B. CereusATCC 14579
Subspecies level (B. cereus group)
-0.2
0.0
0.2
0.4
0.6
-0.15
-0.10
-0.05
0.00
-0.4
-0.2
0.0
PC 3
PC 1
PC 2
UnknownB. cereusATCC 14789
B. cereus Group B. subtilis/other Bacilli
Principal Component Analysisof Peptide Assignments
Representation of the database Bacillaceae species and unknown organismin the principal component space (PC 1, PC 2, PC 3) reflecting 80 % of the total information included in the assignment matrix of 125 amino acid peptide sequences to bacterial proteomes.
B. c
ereu
sgr
oup
0 20 40 60 80 100 120Linkage Distance
B. subtilis
B. licheniformis
O. iheyensis
B. halodurans
B. cereus ATCC10987
B. thuringiensis
B. cereus ZK
B. anthracis A2012
B. anthracis Sterne
B. anthracis Ames ancestorB. anthracis Ames
B. cereus ATCC14579Unknown
Error
Baci
llace
ae
B. c
ereu
sgr
oup
0 20 40 60 80 100 120Linkage Distance
B. subtilis
B. licheniformis
O. iheyensis
B. halodurans
B. cereus ATCC10987
B. thuringiensis
B. cereus ZK
B. anthracis A2012
B. anthracis Sterne
B. anthracis Ames ancestorB. anthracis Ames
B. cereus ATCC14579Unknown
Error
Baci
llace
ae
Cluster Analysis of Peptide Assignments
Hierarchical clustering of Bacillaceae species in 125-dimensional space of peptide sequences.(Complete linkage; squared Euclidean distances)
Differences in Proteome Composition Between an Unknown Sample and Database Bacteria
Pept
ide
Num
ber
0102030405060708090
100110120
B. anth
racis
A2012
B. anth
racis
Ames
B. anth
racis
Ames an
cesto
r
B. anth
racis
Sterne
B. cere
us A
TCC 10
987
B. cere
us ZK
B. thur
ingien
sis
B. cere
us A
TCC 14
789
ESYAAVQADTASK
VGSPQPGDLVFFQGTYK
LVSLAEQQLGGYQK
EYEVPITAAQADQIVLLMK
STDTLGAQILGNTMEGLYR
TNTPMLLQVLEDEVFK
TGDAALGSISNILLR
GDTLTAVDNDLSAWFWDEK
DVNEHTLEEEELPVNIEAYKIEDALNSTR
Blue Line(—) absent; Yellow (—)present
Unknown sample: correctly identified as B. cereus ATCC 14789
45
47
107
0 200 400 600 800 1000 1200 1400 1600
100
50
0
m/z
Rel
ativ
e A
bund
ance
0 200 400 600 800 1000 1200 1400 1600
100
50
0
m/z
Rel
ativ
e A
bund
ance
b2b3
b5
b4
b7b6
b13b12
b11b10
b9
b8
y2
y3
y5
y4
y6
y7
y8
y13
y12y11
y10
y9
b2b3
b5
b4
b7b6
b13b12
b11b10
b9
b8
y2
y3
y5
y4
y6
y7
y8
y13
y12y11
y10
y9
107
L V S L A E Q Q L G G Y Q Kb2 b3 b5b4 b7b6 b13b12b11b10b9b8
y2y3y5 y4y6y7y8y13 y12 y11 y10 y9
213.3 300.4 413.5 484.6 613.7 741.9 870.0 983.1 1040.2 1097.3 1260.4 1388.6
1421.6 1322.4 1235.4 1122.2 1051.1 922.0 793.9 665.7 552.6 495.5 438.5 275.3
Product Ion Mass Spectrum of a Peptide IonAmino Acid Sequence Information Obtained in Less than 1 second
Discriminative Power of DNA and Protein Sequences
…..L V S L A E Q Q L G G Y Q KAmino Acid
Sequence From the MS/MS Spectrum
Accession LVSLAEQQLGG[ ]YQK 1430022721 52 ...........[ ]... 65 Bacillus cereus ATCC 14579 21402693 52 ...........[ ]F.. 65 Bacillus anthracis str. A2012 42783829 52 ...........[.]F.. 65 Bacillus cereus ATCC 10987 21401004 56 ...........[G]VTR 69 Bacillus anthracis str. A2012 3688809 54 ..QM....M..[ ]... 67 Bacillus firmus 15613701 53 ..AM.......[G]F.. 67 Bacillus halodurans 15615764 54 ..AM....M..[ ]F.Q 67 Bacillus halodurans 134232 56 ..Q........[G]RSK 66 Bacillus megaterium 134239 56 ..Q........[G]RF 69 Bacillus megaterium 134223 56 ..Q........[G]RF 69 Bacillus megaterium 21399863 54 ..AM.......[G].TR 64 Bacillus anthracis str.A2012 30020123 56 ..AM.......[G].TR 70 Bacillus cereus ATCC 14579 30021203 56 ..AM.......[ ]RANR 70 Bacillus cereus ATCC 14579 42782199 56 ..AM.......[ ]RANR 70 Bacillus cereus ATCC 10987 21401007 56 ..AM.......[ ]RANR 70 Bacillus anthracis str. A2012 21398813 54 ..AM...S...[ ]FH. 67 Bacillus anthracis str. A2012 134230 57 .....Q.....[G]TSF 70 Thermoactinomyces sacchari
Smal
l, A
cid-
Solu
ble
Spor
ePr
otei
ns
…..L V S L A E Q Q L G G F Q KAmino Acid Sequences of
Matching Peptides Found in the Database
SASP-21402693
B.cereus.ATTC 14579……GAT CAA AGT GAT CGA CTC GTT GTT AAT CCG CCA ATG GTT TTTB.anthracis A2012 ……….GAT CAA AGT GAT CGA CTC GTT GTT AAT CCG CCA AAG GTT TTT
DNA Sequences
Peptide # Sequence Protein
8 IEDALNSTR 60 kDa chaperonin GROEL 12 DVNEHTLEEEELPVNIEAYK Hypothetical protein BC0479
Putative transcriptional regulator19 ESYAAVQADTASKGDTLTAVDNDLSAWFWDEK Spore coat-associated protein N KQPNFDDSSNFAK Hypothetical protein BA 3347 [Bacillus anthracis Ames]
47 TGDAALGSISNILLR Flagellin 59 TNTPMLLQVLEDEVFK Propionyl-CoA carboxylase biotin-containing subunit 73 STDTLGAQILGNTMEGLYR Oligopeptide-binding protein oppA 93 EYEVPITAAQADQIVLLMK IG hypothetical 17696
107 LVSLAEQQLGGYQK Small acid-soluble spore protein 119 VGSPQPGDLVFFQGTYK N-acetylmuramoyl-L-alanine amidase
2945
Selected Peptide Sequences Discriminating Between an Unknown and Database Bacteria
47
45
Genes Encoding Identified Proteins
Location on the chromosome of Bacillus cereus ATCC 14579 (Mbp)1 2 3 4 5
Pro
tein
leng
th (
num
ber
of c
odon
s)
-600
-400
-200
0
200
400
600
0
(-) DNA Strand
(+) DNA Strand
• Identification of pure cultures as well as mixtures of microorganisms.
CONCLUSIONS
The results demonstrate that mass spectrometry-based proteomics approach allows for :
• High confidence level classification and identificationof bacteria based on genome traceable, proteomic similaritiesand differences between an analyzed microorganism andreference bacteria;
Thank you !!!Thank you !!!