Post on 31-Dec-2015
description
transcript
1
Mass Spectrometry-based Proteomics
Xuehua Shen
(Adapted from slides with textbook)
2
Outline
• Motivation of proteomics
• Mass spectrometry-based proteomics
• Instrumentation of mass spectrometry
• De novo sequencing algorithm
• Database search
• Algorithms of real software (e.g., sequence tags)
3
Motivation
• Proteins are working units of the cells
– The number of found genes is much less than the number of expressed proteins
– Directly related with cell processes and diseases
>1,000,000 distinct protein forms
~30,000 human genes
DNA
Alternative
splicing
mRNA Protein
Post-translational
Modification
>100,000 RNA messages
SNP
4
Tools for Proteomics
• Edman degradation reaction
• NMR (Nuclear Magnetic Resonance)
• X-ray crystallography
• Protein array
• Mass Spectrometry
5
Mass Spectrometry-based Proteomics
• Primary sequence (sequencing, identification)
• Post-translational modification (PTM) (characterization)
• Quantitative proteomics (quantification)
• Protein-protein interaction
6
7
Components of Mass Spectrometer
• Ion source (ESI and MALDI)
• Mass analyzer (ion traps, TOF, Quadrupole, FT, etc.)
– Mass-to-charge ratio (m/z)
• Ion detector
8
Peptide and Intact Protein
• Peptide: a fragment of protein
• Some enzymes, e.g. trypsin, break protein into peptides.
• Some technology put intact protein into the mass spectrometer
9
Peptide Fragmentation
• Peptides tend to fragment along the backbone.
• Fragments can also loose neutral chemical groups like NH3 and H2O.
H...-HN-CH-CO . . . NH-CH-CO-NH-CH-CO-…OH
Ri-1 Ri Ri+1
H+
N-Terminus C-Terminus
Collision Induced Dissociation
10
Ideal Mass Spectrum
11
Real Mass Spectrum
12
N- and C-terminal Peptides
N-term
inal
pep
tides
C-te
rmin
al p
eptid
es
13
Terminal peptides and ion types
Peptide
Mass (D) 57 + 97 + 147 + 114 = 415
Peptide
Mass (D) 57 + 97 + 147 + 114 – 18 = 397
without
14
N- and C-terminal Peptides
N-term
inal
pep
tides
C-te
rmin
al p
eptid
es
415
486
301
154
57
71
185
332
429
15
N- and C-terminal Peptides
N-term
inal
pep
tides
C-te
rmin
al p
eptid
es
415
486
301
154
57
71
185
332
429
16
N- and C-terminal Peptides
415
486
301
154
57
71
185
332
429
17
N- and C-terminal Peptides
415
486
301
154
57
71
185
332
429
Problem:
Reconstruct peptide from the set of masses of fragment
18
Mass Spectra
G V D L K
mass0
57 Da = ‘G’ 99 Da = ‘V’LK D V G
• The peaks in the mass spectrum:
– Prefix
– Fragments with neutral losses (-H2O, -NH3)
– Noise and missing peaks.
and Suffix Fragments.
D
H2O
19
Protein Identification with MS/MS
G V D L K
mass0
Inte
nsity
mass0
MS/MSPeptide Identification:
20
Protein Identification by Tandem Mass Spectrometry
SSeeqquueennccee
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
200 400 600 800 1000 1200 1400 1600 1800 2000
m/z
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
Rel
ativ
e Ab
unda
nce
850.3
687.3
588.1
851.4425.0
949.4
326.0524.9
589.2
1048.6397.1226.9
1049.6489.1
629.0
MS/MS instrumentMS/MS instrument
De Novo interpretation•SherengaDatabase search•Sequest
21
De Novo vs. Database Search
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
200 400 600 800 1000 1200 1400 1600 1800 2000m/z
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
Re
lative
Ab
un
da
nce
850.3
687.3
588.1
851.4425.0
949.4
326.0524.9
589.2
1048.6397.1226.9
1049.6489.1
629.0
WR
A
C
VG
E
K
DW
LP
T
L T
WR
A
C
VG
E
K
DW
LP
T
L T
De Novo
AVGELTK
Database Search
Database ofknown peptides
MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
Database ofknown peptides
MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
Mass, Score
22
Pros and Cons of de novo Sequencing
• Advantage:– Gets the sequences that are not necessarily in the database.
– An additional similarity search step using these sequences may identify the related proteins in the database.
• Disadvantage:– Requires higher quality data.
– Often contains errors.
23
Current Status
• It is still a open problem of protein sequencing no matter whether using de novo sequencing or database search methods
• Following algorithms only deal with simplified (or ideal) spectrums
• Some algorithms combine de novo sequencing and database search
24
Outline
• Motivation of proteomics
• Mass spectrometry-based proteomics
• Instrumentation of mass spectrometry
• De novo sequencing
• Database search
• Algorithms of real software (e.g., sequence tags)
25
De novo Peptide Sequencing
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
200 400 600 800 1000 1200 1400 1600 1800 2000
m/z
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
Rel
ativ
e A
bund
ance
850.3
687.3
588.1
851.4425.0
949.4
326.0524.9
589.2
1048.6397.1226.9
1049.6489.1
629.0
SequenceSequence
26
Peptide Sequencing Problem
Goal: Find a peptide with maximal match between an experimental and theoretical spectrum.
Input:
– S: experimental spectrum
– Δ: set of possible ion types
– m: parent mass
Output:
– P: peptide with mass m, whose theoretical spectrum matches the experimental S spectrum the best
27
Procedure of De Novo Sequencing
• Build spectrum graph
– How to create vertices (from masses)
– How to create edges (from mass differences)
• Find best path or rank paths of spectrum graph
– How to find candidate paths
– How to score paths
28
S E Q U E N C Eb
Mass/Charge (M/Z)Mass/Charge (M/Z)
From Sequence to Spectrum
29
a
Mass/Charge (M/Z)Mass/Charge (M/Z)
S E Q U E N C E
From Sequence to Spectrum(cont.)
30
S E Q U E N C E
Mass/Charge (M/Z)Mass/Charge (M/Z)
a is an ion type shift in b
From Sequence to Spectrum(cont.)
31
y
Mass/Charge (M/Z)Mass/Charge (M/Z)
E C N E U Q E S
From Sequence to Spectrum (cont.)
32
Mass/Charge (M/Z)Mass/Charge (M/Z)
Inte
nsit
yIn
tens
ity
From Sequence to Spectrum (cont.)
33
Mass/Charge (M/Z)Mass/Charge (M/Z)
Inte
nsit
yIn
tens
ity
From Sequence to Spectrum (cont.)
34
noise
Mass/Charge (M/Z)Mass/Charge (M/Z)
From Sequence to Spectrum (cont.)
35
MS/MS Spectrum
Mass/Charge (M/z)Mass/Charge (M/z)
Inte
nsit
yIn
tens
ity
36
Some Mass Differences between Peaks Correspond to Amino Acids
ss
ssss
ee
eeee
ee
ee
ee
ee
ee
qquu
uu
uu
nn
nn
nn
ee
cc
cc
cc
37
Now decoding from spectrum to sequence…?
Build spectrum graph
38
Vertices of Spectrum Graph
• Vertices are generated by reverse shifts corresponding to
ion types Δ={δ1, δ2,…, δk}
• Every mass s in an MS/MS spectrum generates k vertices
V(s) = {s+δ1, s+δ2, …, s+δk}
corresponding to potential N-terminal peptides
• Vertices of the spectrum graph:
{initial vertex}V(s1) V(s2) ... V(sm) {terminal vertex}
39
Reverse Shifts
Shift in H2O+NH3
Shift in H2O
40
Edges of Spectrum Graph
• Two vertices with mass difference corresponding to
an amino acid A:
– Connect with an edge labeled by A
• Gap edges for di- and tri-peptides
– Potential sequence tag method (covered later)
41
Best Path of Spectrum Graph
• How to find candidate paths
• There are many paths, how to find the correct one?
• We need scoring to evaluate paths
42
Find Candidate Paths
• Heuristics: find a path with maximum number
of edges
• Longest path problem in DAG
• DFS (Depth First Search)
43
Path Score
• p(P,S) = probability that peptide P produces spectrum S= {s1,s2,…sq}
• p(P, s) = the probability that peptide P generates a peak s
• Scoring = computing probabilities
44
Finding Optimal Paths in the Spectrum Graph
• For a given MS/MS spectrum S, find a peptide P’ maximizing p(P,S) over all possible peptides P:
• Peptides = paths in the spectrum graph
• P’ = the optimal path in the spectrum graph
• Some software rank paths
p(P,S)p(P',S) Pmax
45
Ions and Probabilities
• A peptide has all k peaks with probability
• and no peaks with probability
• A peptide also produces a ``random noise'' with uniform probability qR in any position.
k
iiq
1
k
iiq
1
)1(
46
Ratio Test Scoring for Partial Peptides
• Incorporates premiums for observed ions and penalties for missing ions.
• Example: for k=4, assume that for a partial peptide P’ we only see ions δ1,δ2,δ4.
The score is calculated as:
RRRR q
q
q
q
q
q
q
q 4321
)1(
)1(
47
Why Not Sequence De Novo?
• De novo sequencing is still not very accurate!
• Less than 30% of the peptides sequenced were completely correct!
Algorithm Amino Acid Accuracy
Whole Peptide Accuracy
Lutefisk (Taylor and Johnson, 1997). 0.566 0.189
SHERENGA (Dancik et. al., 1999). 0.690 0.289
Peaks (Ma et al., 2003). 0.673 0.246
PepNovo (Frank and Pevzner, 2005). 0.727 0.296
48
Thank you !
The End
49
De Novo vs. Database Search
S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]
200 400 600 800 1000 1200 1400 1600 1800 2000m/z
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
Re
lative
Ab
un
da
nce
850.3
687.3
588.1
851.4425.0
949.4
326.0524.9
589.2
1048.6397.1226.9
1049.6489.1
629.0
WR
A
C
VG
E
K
DW
LP
T
L T
WR
A
C
VG
E
K
DW
LP
T
L T
De Novo
AVGELTK
Database Search
Database ofknown peptides
MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
Database ofknown peptides
MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT,
HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,
ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC,
GVFGSVLRA, EKLNKAATYIN..
50
De Novo vs. Database Search: A Paradox
• de novo algorithms are much faster, even though their search space is much larger!
• A database search scans all peptides in the search space to find best one.
• De novo eliminates the need to scan all peptides by modeling the problem as a graph search.
Why not sequence de novo?
51
Outline
• Motivation of proteomics
• Mass spectrometry-based proteomics
• Instrumentation: Mass Spectrometry
• De novo sequencing algorithm
• Database search
• Algorithms of real software (e.g., sequence tags)
52
Peptide Identification Problem
Goal: Find a peptide from the database with maximal match between an experimental and theoretical spectrum.
Input:
– S: experimental spectrum
– database of peptides
– Δ: set of possible ion types
– m: parent mass
Output:
– A peptide of mass m from the database whose theoretical spectrum matches the experimental S spectrum the best
53
MS/MS Database Search
Database search in mass-spectrometry has been very successful in identification of already known proteins.
Experimental spectrum can be compared with theoretical spectra of database peptides to find the best fit.
SEQUEST (Yates et al., 1995)
But reliable algorithms for identification of modified peptides is a much more difficult problem.
54
Post-Translational ModificationsProteins are involved in cellular signaling and
metabolic regulation.
They are subject to a large number of biological modifications.
Almost all protein sequences are post-translationally modified and 200 types of modifications of amino acid residues are known.
55
Examples of Post-Translational Modification
Post-translational modifications increase the number of “letters” in amino acid alphabet and lead to a combinatorial explosion in both database search and de novo approaches.
56
Search for Modified Peptides: Virtual Database Approach
Yates et al.,1995: an exhaustive search in a virtual database of all modified peptides.
Exhaustive search leads to a large combinatorial problem, even for a small set of modifications types.
Problem (Yates et al.,1995). Extend the virtual database approach to a large set of modifications.
57
Exhaustive Search for Modified Peptides
• YFDSTDYNMAK
• 25=32 possibilities, with 2 types of modifications!
Phosphorylation?
Oxidation?
• For each peptide, generate all modifications.
• Score each modification.
58
Modified Peptide Identification Problem
Goal: Find a modified peptide from the database with maximal match between an experimental and theoretical spectrum.
Input:
– S: experimental spectrum
– database of peptides
– Δ: set of possible ion types
– m: parent mass
– Parameter k (# of mutations/modifications)
Output:
– A peptide of mass m that is at most k mutations/modifications apart from a database peptide and whose theoretical spectrum matches the experimental S spectrum the best
59
Peptide Identification Problem: Challenge
Very similar peptides may have very different spectra!
Goal: Define a notion of spectral similarity that correlates well with the sequence similarity.
If peptides are a few mutations/modifications apart, the spectral similarity between their spectra should be high.
60
Spectrum Alignment
• See 8.14 and 8.15 in the text book for one algorithm
• Complicated for real spectrums
61
Quality Measure of Mass Spectrometer
• Sensitivity
• Mass accuracy
• Resolution
• Dynamic range
62
Ion Types
• Some masses correspond to fragment
ions, others are just random noise
• Knowing ion types Δ={δ1, δ2,…, δk} lets us
distinguish fragment ions from noise
• We can learn ion types δi and their
probabilities qi by analyzing a large test
sample of annotated spectra.
65
Database Search: Sequence Analysis vs. MS/MS AnalysisSequence analysis:
similar peptides (that a few mutations apart) have similar sequences
MS/MS analysis:
similar peptides (that a few mutations apart) have dissimilar spectra
66
Deficiency of the Shared Peaks Count
Shared peaks count (SPC): intuitive measure of spectral similarity.
Problem: SPC diminishes very quickly as the number of mutations increases.
Only a small portion of correlations between the spectra of mutated peptides is captured by SPC.
67
Ions and Probabilities
• Tandem mass spectrometry is characterized by a set of ion types {δ1,δ2,..,δk} and their probabilities {q1,...,qk}
•δi-ions of a partial peptide are produced independently with probabilities qi
68
De Novo vs. Database Search:
• The database of all peptides is huge ≈ O(20n) .
• The database of all known peptides is much smaller ≈ O(108).
• However, de novo algorithms can be much faster, even though their search space is much larger!
• A database search scans all peptides in the database of all known peptides search space to find best one.
• De novo eliminates the need to scan database of all peptides by modeling the problem as a graph search.
69
Probabilistic Model
• For a position t δj Ti the probability p(t, P,S) that peptide P produces a peak at position t.
• Similarly, for tR, the probability that P produces a random noise peak at t is:
otherwise1
position tat generated ispeak a if),,( j
j
j
q
qSPtP
otherwise1
position tat generated ispeak a if)(
R
RR q
qtP
70
Probabilistic Score
• For a peptide P with n amino acids, the score for the whole peptides is expressed by the following ratio test:
n
i
k
j iR
i
R j
j
tp
SPtp
Sp
SPp
1 1 )(
),,(
)(
),(
71
• For a position t that represents ion type dj :
qj, if peak is generated at t
p(P,st) =
1-qj , otherwise
Peak Score
72
Peak Score (cont.)
• For a position t that is not associated with an ion type:
qR , if peak is generated at t
pR(P,st) =
1-qR , otherwise
• qR = the probability of a noisy peak that does not correspond to any ion type