Post on 05-Jan-2016
description
transcript
Proteomics Informatics – Protein identification I: searching protein
sequence collections and significance testing (Week 4)
2
Peptide Mapping - Mass Accuracy
3
Peptide MappingDatabase Size
C. elegans
S. cerevisiae
Human
4
Peptide MappingCys-ContainingPeptides
C. elegans
S. cerevisiae
Human
MS
Identification – Peptide Mass Fingerprinting
MS
Digestion
All Peptide Masses
Pick Protein
Compare, Score, Test Significance
Rep
eat for each
pro
teinSequence
DB
Identified Proteins
ProFound Results
Database size
Mixtures
Peptide FragmentationMass
Analyzer 1Frag-
mentationDetector
Ion Source
Mass Analyzer 2
b
y
Identification – Tandem MS
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
Tandem MS – Sequence Confirmation
KLEDEELFGS
K1166
L1020
E907
D778
E663
E534
L405
F292
G145
S88 b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
KLEDEELFGS
Tandem MS – Sequence Confirmation
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
KLEDEELFGS
Tandem MS – Sequence Confirmation
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292405 534
907 1020663 778 1080
1022
KLEDEELFGS
Tandem MS – Sequence Confirmation
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292405 534
907 1020663 778 1080
1022
KLEDEELFGS
Tandem MS – Sequence Confirmation
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292405 534
907 1020663 778 1080
1022
113
KLEDEELFGS
113
Tandem MS – Sequence Confirmation
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292405 534
907 1020663 778 1080
1022
129
129
KLEDEELFGS
Tandem MS – Sequence Confirmation
KLEDEELFGS
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292405 534
907 1020663 778 1080
1022
Tandem MS – Sequence Confirmation
KLEDEELFGS
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292405 534
907 1020663 778 1080
1022
Tandem MS – Sequence Confirmation
KLEDEELFGS
147K
1166L
260
1020E
389
907D
504
778E
633
663E
762
534L
875
405F
1022
292G
1080
145S
1166
88
y ions
b ions
m/z
% R
elat
ive
Abu
ndan
ce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292405 534
907 1020663 778 1080
1022
Tandem MS – Sequence Confirmation
Tandem MS – de novo Sequencing
m/z
% R
ela
tive
Ab
un
da
nce
100
0250 500 750 1000
[M+2H]2+
762
260 389 504
633
875
292405 534
9071020663 778 1080
1022
Mass Differences
1-letter code
3-letter code
Chemical formula
Monoisotopic
Average
A Ala C3H5ON 71.0371 71.0788
R Arg C6H12ON4 156.101 156.188
N Asn C4H6O2N2 114.043 114.104
D Asp C4H5O3N 115.027 115.089
C Cys C3H5ONS 103.009 103.139
E Glu C5H7O3N 129.043 129.116
Q Gln C5H8O2N2 128.059 128.131
G Gly C2H3ON 57.0215 57.0519
H His C6H7ON3 137.059 137.141
I Ile C6H11ON 113.084 113.159
L Leu C6H11ON 113.084 113.159
K Lys C6H12ON2 128.095 128.174
M Met C5H9ONS 131.04 131.193
F Phe C9H9ON 147.068 147.177
P Pro C5H7ON 97.0528 97.1167
S Ser C3H5O2N 87.032 87.0782
T Thr C4H7O2N 101.048 101.105
W Trp C11H10ON2 186.079 186.213
Y Tyr C9H9O2N 163.063 163.176
V Val C5H9ON 99.0684 99.1326
Amino acid masses
Sequences consistent
with spectrum
Tandem MS – de novo Sequencing260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079
260 32 129 145 244 274 373 403 502 518 615 647 760 762 819
292 97 113 212 242 341 371 470 486 583 615 728 730 787
389 16 115 145 244 274 373 389 486 518 631 633 690
405 99 129 228 258 357 373 470 502 615 617 674
504 30 129 159 258 274 371 403 516 518 575
534 99 129 228 244 341 373 486 488 545
633 30 129 145 242 274 387 389 446
663 99 115 212 244 357 359 416
762 16 113 145 258 260 317
778 97 129 242 244 301
875 32 145 147 204
907 113 115 172
1020 2 59
1022 57
Tandem MS – de novo Sequencing260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079
260 32 129 145 244 274 373 403 502 518 615 647 760 762 819
292 97 113 212 242 341 371 470 486 583 615 728 730 787
389 16 115 145 244 274 373 389 486 518 631 633 690
405 99 129 228 258 357 373 470 502 615 617 674
504 30 129 159 258 274 371 403 516 518 575
534 99 129 228 244 341 373 486 488 545
633 30 129 145 242 274 387 389 446
663 99 115 212 244 357 359 416
762 16 113 145 258 260 317
778 97 129 242 244 301
875 32 145 147 204
907 113 115 172
1020 2 59
1022 57
260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079
260 32 E 145 244 274 373 403 502 518 615 647 760 762 819
292 P I/L 212 242 341 371 470 486 583 615 728 730 787
389 16 D 145 244 274 373 389 486 518 631 633 690
405 V E 228 258 357 373 470 502 615 617 674
504 30 E 159 258 274 371 403 516 518 575
534 V E 228 244 341 373 486 488 545
633 30 E 145 242 274 387 389 446
663 V D 212 244 357 359 416
762 16 I/L 145 258 260 317
778 P E 242 244 301
875 32 145 F 204
907 I/L D 172
1020 2 59
1022 G
Tandem MS – de novo Sequencing
X
X
X
X
X
X
…GF(I/L)EEDE(I/L)……(I/L)EDEE(I/L)FG……GF(I/L)EEDE(I/L)……(I/L)EDEE(I/L)FG…
Peptide M+H = 11661166 -1079 = 87 => S
SGF(I/L)EEDE(I/L)…
SGF(I/L)EEDE(I/L)…
1166 – 1020 – 18 = 128Þ K or Q
SGF(I/L)EEDE(I/L)(K/Q)
Tandem MS – de novo Sequencing
Challenges in de novo sequencing
Neutral loss (-H2O, -NH3)
Modifications
Background peaks
Incomplete information
Challenges in de novo sequencing
Neutral loss (-H2O, -NH3)
Modifications
Background peaks
Incomplete information
MS/MS
LysisFractionation
Tandem MS – Database Search
MS/MS
Digestion
SequenceDB
All FragmentMasses
Pick Protein
Compare, Score, Test Significance
Rep
eat for all p
rotein
s
Pick PeptideLC-MS
Rep
eat for
all pep
tides
Search Results
Significance Testing
False protein identification is caused by random matching
An objective criterion for testing the significance of protein identification results is necessary.
The significance of protein identifications can be tested once the distribution of scores for false results is known.
Significance Testing - Expectation Values
The majority of sequences in a collection will give a score due to random matching.
Database Search
M/Z
List of Candidates
ExtrapolateAnd Calculate Expectation Values
List of Candidates With Expectation Values
Distribution of Scoresfor Random and False Identifications
Significance Testing - Expectation Values
Rho-diagrams: Overall Quality of a Data Set
)exp()( sse
iN
iNi
EE i
))}1exp(1{
)}1exp(1){exp(log()log()(
0
)}1exp(){exp()exp(
)1exp(
iiNNdeie
ieiE
Definition: Ei (i=0,-1,-2,…) is the number of spectra that has been assigned an expectation value between exp(i) and exp(i-1). For random matching:
Expectation values as a function of score for random matching:
-6
-5
-4
-3
-2
-1
0
-6 -5 -4 -3 -2 -1 0
log(e)
Rho-diagramRandom Matching
Rho-diagramData Quality
-10
-8
-6
-4
-2
0
-10 -8 -6 -4 -2 0
log(e)
Rho-diagramParameters
How many fragments are sufficient?
To identify an unmodified peptide?To identify an unmodified peptide?
To identify a modified peptide?
To localize a modification on a peptide?
To identify an unmodified peptide?
To identify a modified peptide?
How many fragments are sufficient?
How does it depend on different parameters?
• Precursor mass• Precursor mass error• Fragment mass error• Background peaks
LSDPGVSPAVLSLEMLTDR
Simulations using synthetic spectra
Select a peptide sequence
Calculate possiblefragment ion masses
Choose number of fragment ions to select
Randomly selectfragment ions
Search and store result
Average over peptides
Seq.DB
1825.921710.891609.841496.761365.721236.681123.591036.56923.48824.41753.37656.32569.29470.22413.20316.15201.12114.09
175.12290.15391.19504.28635.32764.36877.44964.481077.561176.631247.671344.721431.751530.821587.841684.891799.921886.95
LSDPGVSPAVLSLEMLTDR
Simulations using synthetic spectra
Select a peptide sequence
Calculate possiblefragment ion masses
Choose number of fragment ions to select
Randomly selectfragment ions
Search and store result
Average over peptides
Seq.DB
6
8 97
5
1825.921710.891609.841496.761365.721236.681123.591036.56923.48824.41753.37656.32569.29470.22413.20316.15201.12114.09
175.12290.15391.19504.28635.32764.36877.44964.481077.561176.631247.671344.721431.751530.821587.841684.891799.921886.95
LSDPGVSPAVLSLEMLTDR
Simulations using synthetic spectra
Select a peptide sequence
Calculate possiblefragment ion masses
Choose number of fragment ions to select
Randomly selectfragment ions
Search and store result
Average over peptides
8
6
8 97
5
1825.921710.891609.841496.761365.721236.681123.591036.56923.48824.41753.37656.32569.29470.22413.20316.15201.12114.09
175.12290.15391.19504.28635.32764.36877.44964.481077.561176.631247.671344.721431.751530.821587.841684.891799.921886.95
Simulations using synthetic spectra
Select a peptide sequence
Calculate possiblefragment ion masses
Choose number of fragment ions to select
Randomly selectfragment ions
Search and store result
Average over peptides
8
201.12504.28964.481123.591247.671496.761530.821710.89
201.12504.28964.481123.591247.671496.761530.821710.89
Seq.DB
Simulations using synthetic spectra
Select a peptide sequence
Calculate possiblefragment ion masses
Choose number of fragment ions to select
Randomly selectfragment ions
Search and store result
Average over peptidesSearchengine
Identification
LSDPGVSPAVLSLEMLTDR Seq.DB
Is it significant?
Is the identified sequence identical to the one used to generate the synthetic data?
1825.921710.891609.841496.761365.721236.681123.591036.56923.48824.41753.37656.32569.29470.22413.20316.15201.12114.09
175.12290.15391.19504.28635.32764.36877.44964.481077.561176.631247.671344.721431.751530.821587.841684.891799.921886.95
Simulations using synthetic spectra
201.12504.28964.481123.591247.671496.761530.821710.89
Seq.DB
SearchengineIdentification
6
8 97
5
8
Select a peptide sequence
Calculate possiblefragment ion masses
Choose number of fragment ions to select
Randomly selectfragment ions
Search and store result
Average over peptides
Simulations using synthetic spectra
1825.921710.891609.841496.761365.721236.681123.591036.56923.48824.41753.37656.32569.29470.22413.20316.15201.12114.09
175.12290.15391.19504.28635.32764.36877.44964.481077.561176.631247.671344.721431.751530.821587.841684.891799.921886.95
1825.921710.891609.841496.761365.721236.681123.591036.56923.48824.41753.37656.32569.29470.22413.20316.15201.12114.09
175.12290.15391.19504.28635.32764.36877.44964.481077.561176.631247.671344.721431.751530.821587.841684.891799.921886.95
201.12504.28964.481123.591247.671496.761530.821710.89
Seq.DB
SearchengineIdentification
6
8 97
5
9
Select a peptide sequence
Calculate possiblefragment ion masses
Choose number of fragment ions to select
Randomly selectfragment ions
Search and store result
Average over peptides
6
8 97
5
1825.921710.891609.841496.761365.721236.681123.591036.56923.48824.41753.37656.32569.29470.22413.20316.15201.12114.09
175.12290.15391.19504.28635.32764.36877.44964.481077.561176.631247.671344.721431.751530.821587.841684.891799.921886.95
LSDPGVSPAVLSLEMLTDR
Simulations using synthetic spectra
Select a peptide sequence
Calculate possiblefragment ion masses
Choose number of fragment ions to select
Randomly selectfragment ions
Search and store result
Average over peptides
LSDPGVSPAVLSLEMLTDRProt.seq.
201.12504.28964.481123.591247.671496.761530.821710.89
201.12504.28964.481123.591247.671496.761530.821710.89
Seq.DB
SearchengineIdentification
Is it significant?
Is the identified sequence identical to the one used to generate the synthetic data?
LSDPGVSPAVLSLEMLTDR
8
Select a peptide sequence
Calculate possiblefragment ion masses
Choose number of fragment ions to select
Randomly selectfragment ions
Search and store result
Average over peptides
Simulations using synthetic spectra
Each point is an average of searches with 20 randomly generated synthetic fragment mass spectra.
Threshold
Each point is an average of 50 peptides.
Average over peptides
Critical number of fragment masses
0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20
Pro
ba
bili
ty o
f Id
en
tifi
ca
tio
n
Number of fragment ions
1000 Da1500 Da2000 Da2500 Da
Small peptides are slightly more difficult to identify
Dmprecursor = 1 DaDmfragment = 0.5 DaNo modification
mprecursor
A lower precursor mass error requires fewer fragment masses for identification of unmodified peptides
0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20
Pro
ba
bili
ty o
f Id
en
tifi
ca
tio
n
Number of fragment ions
0.01 Da
1 Da
10 Da
mprecursor = 2000 DaDmfragment = 0.5 DaNo modification
0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20
Pro
ba
bili
ty o
f Id
en
tifi
ca
tio
n
Number of fragment ions
0.01 Da0.5 Da1 Da2 Da
The dependence on the fragment mass error is weak below a threshold for identification
of unmodified peptides
Dmfragment
mprecursor = 2000 DaDmprecursor = 1 DaNo modification
0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20
Pro
ba
bili
ty o
f Id
en
tifi
ca
tio
n
Number of fragment ions
0%
50%
80%
A moderate number of background peaks can be tolerated when identifying
unmodified peptides
mprecursor = 2000 DaDmprecursor = 1 DaDmfragment = 0.5 DaNo modification
Background
A large number of background peaks can be tolerated if the fragment mass is accurate
mprecursor = 2000 DaDmprecursor = 1 DaDmfragment = 0.01 DaNo modification
0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20
Pro
ba
bili
ty o
f Id
en
tifi
ca
tio
n
Number of fragment ions
0%
50%
80%
Background
0
0.2
0.4
0.6
0.8
1
1.2
0 5 10 15 20
Pro
ba
bili
ty o
f Id
en
tifi
ca
tio
n
Number of fragment ions
Phosphorylated
Unmodified
Identification of phosphopeptides is only slightly more difficult
mprecursor = 2000 DaDmprecursor = 1 DaDmfragment = 0.5 Da
Proteomics Informatics – Protein identification I: searching protein
sequence collections and significance testing (Week 4)