Proteomics Informatics –

Post on 05-Jan-2016

33 views 0 download

Tags:

description

Proteomics Informatics – Protein identification I: searching protein sequence collections and significance testing  (Week 4). Peptide Mapping - Mass Accuracy. Peptide Mapping Database Size. Human. C. elegans. S. cerevisiae. Peptide Mapping Cys -Containing Peptides. Human. C. elegans. - PowerPoint PPT Presentation

transcript

Proteomics Informatics – Protein identification I: searching protein

sequence collections and significance testing (Week 4)

2

Peptide Mapping - Mass Accuracy

3

Peptide MappingDatabase Size

C. elegans

S. cerevisiae

Human

4

Peptide MappingCys-ContainingPeptides

C. elegans

S. cerevisiae

Human

MS

Identification – Peptide Mass Fingerprinting

MS

Digestion

All Peptide Masses

Pick Protein

Compare, Score, Test Significance

Rep

eat for each

pro

teinSequence

DB

Identified Proteins

ProFound Results

Database size

Mixtures

Peptide FragmentationMass

Analyzer 1Frag-

mentationDetector

Ion Source

Mass Analyzer 2

b

y

Identification – Tandem MS

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

Tandem MS – Sequence Confirmation

KLEDEELFGS

K1166

L1020

E907

D778

E663

E534

L405

F292

G145

S88 b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

KLEDEELFGS

Tandem MS – Sequence Confirmation

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

KLEDEELFGS

Tandem MS – Sequence Confirmation

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 1080

1022

KLEDEELFGS

Tandem MS – Sequence Confirmation

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 1080

1022

KLEDEELFGS

Tandem MS – Sequence Confirmation

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 1080

1022

113

KLEDEELFGS

113

Tandem MS – Sequence Confirmation

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 1080

1022

129

129

KLEDEELFGS

Tandem MS – Sequence Confirmation

KLEDEELFGS

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 1080

1022

Tandem MS – Sequence Confirmation

KLEDEELFGS

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 1080

1022

Tandem MS – Sequence Confirmation

KLEDEELFGS

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

m/z

% R

elat

ive

Abu

ndan

ce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

907 1020663 778 1080

1022

Tandem MS – Sequence Confirmation

Tandem MS – de novo Sequencing

m/z

% R

ela

tive

Ab

un

da

nce

100

0250 500 750 1000

[M+2H]2+

762

260 389 504

633

875

292405 534

9071020663 778 1080

1022

Mass Differences

1-letter code

3-letter code

Chemical formula

Monoisotopic

Average

A Ala C3H5ON 71.0371 71.0788

R Arg C6H12ON4 156.101 156.188

N Asn C4H6O2N2 114.043 114.104

D Asp C4H5O3N 115.027 115.089

C Cys C3H5ONS 103.009 103.139

E Glu C5H7O3N 129.043 129.116

Q Gln C5H8O2N2 128.059 128.131

G Gly C2H3ON 57.0215 57.0519

H His C6H7ON3 137.059 137.141

I Ile C6H11ON 113.084 113.159

L Leu C6H11ON 113.084 113.159

K Lys C6H12ON2 128.095 128.174

M Met C5H9ONS 131.04 131.193

F Phe C9H9ON 147.068 147.177

P Pro C5H7ON 97.0528 97.1167

S Ser C3H5O2N 87.032 87.0782

T Thr C4H7O2N 101.048 101.105

W Trp C11H10ON2 186.079 186.213

Y Tyr C9H9O2N 163.063 163.176

V Val C5H9ON 99.0684 99.1326

Amino acid masses

Sequences consistent

with spectrum

Tandem MS – de novo Sequencing260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079

260 32 129 145 244 274 373 403 502 518 615 647 760 762 819

292 97 113 212 242 341 371 470 486 583 615 728 730 787

389 16 115 145 244 274 373 389 486 518 631 633 690

405 99 129 228 258 357 373 470 502 615 617 674

504 30 129 159 258 274 371 403 516 518 575

534 99 129 228 244 341 373 486 488 545

633 30 129 145 242 274 387 389 446

663 99 115 212 244 357 359 416

762 16 113 145 258 260 317

778 97 129 242 244 301

875 32 145 147 204

907 113 115 172

1020 2 59

1022 57

Tandem MS – de novo Sequencing260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079

260 32 129 145 244 274 373 403 502 518 615 647 760 762 819

292 97 113 212 242 341 371 470 486 583 615 728 730 787

389 16 115 145 244 274 373 389 486 518 631 633 690

405 99 129 228 258 357 373 470 502 615 617 674

504 30 129 159 258 274 371 403 516 518 575

534 99 129 228 244 341 373 486 488 545

633 30 129 145 242 274 387 389 446

663 99 115 212 244 357 359 416

762 16 113 145 258 260 317

778 97 129 242 244 301

875 32 145 147 204

907 113 115 172

1020 2 59

1022 57

260 292 389 405 504 534 633 663 762 778 875 907 1020 1022 1079

260 32 E 145 244 274 373 403 502 518 615 647 760 762 819

292 P I/L 212 242 341 371 470 486 583 615 728 730 787

389 16 D 145 244 274 373 389 486 518 631 633 690

405 V E 228 258 357 373 470 502 615 617 674

504 30 E 159 258 274 371 403 516 518 575

534 V E 228 244 341 373 486 488 545

633 30 E 145 242 274 387 389 446

663 V D 212 244 357 359 416

762 16 I/L 145 258 260 317

778 P E 242 244 301

875 32 145 F 204

907 I/L D 172

1020 2 59

1022 G

Tandem MS – de novo Sequencing

X

X

X

X

X

X

…GF(I/L)EEDE(I/L)……(I/L)EDEE(I/L)FG……GF(I/L)EEDE(I/L)……(I/L)EDEE(I/L)FG…

Peptide M+H = 11661166 -1079 = 87 => S

SGF(I/L)EEDE(I/L)…

SGF(I/L)EEDE(I/L)…

1166 – 1020 – 18 = 128Þ K or Q

SGF(I/L)EEDE(I/L)(K/Q)

Tandem MS – de novo Sequencing

Challenges in de novo sequencing

Neutral loss (-H2O, -NH3)

Modifications

Background peaks

Incomplete information

Challenges in de novo sequencing

Neutral loss (-H2O, -NH3)

Modifications

Background peaks

Incomplete information

MS/MS

LysisFractionation

Tandem MS – Database Search

MS/MS

Digestion

SequenceDB

All FragmentMasses

Pick Protein

Compare, Score, Test Significance

Rep

eat for all p

rotein

s

Pick PeptideLC-MS

Rep

eat for

all pep

tides

Search Results

Significance Testing

False protein identification is caused by random matching

An objective criterion for testing the significance of protein identification results is necessary.

The significance of protein identifications can be tested once the distribution of scores for false results is known.

Significance Testing - Expectation Values

The majority of sequences in a collection will give a score due to random matching.

Database Search

M/Z

List of Candidates

ExtrapolateAnd Calculate Expectation Values

List of Candidates With Expectation Values

Distribution of Scoresfor Random and False Identifications

Significance Testing - Expectation Values

Rho-diagrams: Overall Quality of a Data Set

)exp()( sse

iN

iNi

EE i

))}1exp(1{

)}1exp(1){exp(log()log()(

0

)}1exp(){exp()exp(

)1exp(

iiNNdeie

ieiE

Definition: Ei (i=0,-1,-2,…) is the number of spectra that has been assigned an expectation value between exp(i) and exp(i-1). For random matching:

Expectation values as a function of score for random matching:

-6

-5

-4

-3

-2

-1

0

-6 -5 -4 -3 -2 -1 0

log(e)

Rho-diagramRandom Matching

Rho-diagramData Quality

-10

-8

-6

-4

-2

0

-10 -8 -6 -4 -2 0

log(e)

Rho-diagramParameters

How many fragments are sufficient?

To identify an unmodified peptide?To identify an unmodified peptide?

To identify a modified peptide?

To localize a modification on a peptide?

To identify an unmodified peptide?

To identify a modified peptide?

How many fragments are sufficient?

How does it depend on different parameters?

• Precursor mass• Precursor mass error• Fragment mass error• Background peaks

LSDPGVSPAVLSLEMLTDR

Simulations using synthetic spectra

Select a peptide sequence

Calculate possiblefragment ion masses

Choose number of fragment ions to select

Randomly selectfragment ions

Search and store result

Average over peptides

Seq.DB

1825.921710.891609.841496.761365.721236.681123.591036.56923.48824.41753.37656.32569.29470.22413.20316.15201.12114.09

175.12290.15391.19504.28635.32764.36877.44964.481077.561176.631247.671344.721431.751530.821587.841684.891799.921886.95

LSDPGVSPAVLSLEMLTDR

Simulations using synthetic spectra

Select a peptide sequence

Calculate possiblefragment ion masses

Choose number of fragment ions to select

Randomly selectfragment ions

Search and store result

Average over peptides

Seq.DB

6

8 97

5

1825.921710.891609.841496.761365.721236.681123.591036.56923.48824.41753.37656.32569.29470.22413.20316.15201.12114.09

175.12290.15391.19504.28635.32764.36877.44964.481077.561176.631247.671344.721431.751530.821587.841684.891799.921886.95

LSDPGVSPAVLSLEMLTDR

Simulations using synthetic spectra

Select a peptide sequence

Calculate possiblefragment ion masses

Choose number of fragment ions to select

Randomly selectfragment ions

Search and store result

Average over peptides

8

6

8 97

5

1825.921710.891609.841496.761365.721236.681123.591036.56923.48824.41753.37656.32569.29470.22413.20316.15201.12114.09

175.12290.15391.19504.28635.32764.36877.44964.481077.561176.631247.671344.721431.751530.821587.841684.891799.921886.95

Simulations using synthetic spectra

Select a peptide sequence

Calculate possiblefragment ion masses

Choose number of fragment ions to select

Randomly selectfragment ions

Search and store result

Average over peptides

8

201.12504.28964.481123.591247.671496.761530.821710.89

201.12504.28964.481123.591247.671496.761530.821710.89

Seq.DB

Simulations using synthetic spectra

Select a peptide sequence

Calculate possiblefragment ion masses

Choose number of fragment ions to select

Randomly selectfragment ions

Search and store result

Average over peptidesSearchengine

Identification

LSDPGVSPAVLSLEMLTDR Seq.DB

Is it significant?

Is the identified sequence identical to the one used to generate the synthetic data?

1825.921710.891609.841496.761365.721236.681123.591036.56923.48824.41753.37656.32569.29470.22413.20316.15201.12114.09

175.12290.15391.19504.28635.32764.36877.44964.481077.561176.631247.671344.721431.751530.821587.841684.891799.921886.95

Simulations using synthetic spectra

201.12504.28964.481123.591247.671496.761530.821710.89

Seq.DB

SearchengineIdentification

6

8 97

5

8

Select a peptide sequence

Calculate possiblefragment ion masses

Choose number of fragment ions to select

Randomly selectfragment ions

Search and store result

Average over peptides

Simulations using synthetic spectra

1825.921710.891609.841496.761365.721236.681123.591036.56923.48824.41753.37656.32569.29470.22413.20316.15201.12114.09

175.12290.15391.19504.28635.32764.36877.44964.481077.561176.631247.671344.721431.751530.821587.841684.891799.921886.95

1825.921710.891609.841496.761365.721236.681123.591036.56923.48824.41753.37656.32569.29470.22413.20316.15201.12114.09

175.12290.15391.19504.28635.32764.36877.44964.481077.561176.631247.671344.721431.751530.821587.841684.891799.921886.95

201.12504.28964.481123.591247.671496.761530.821710.89

Seq.DB

SearchengineIdentification

6

8 97

5

9

Select a peptide sequence

Calculate possiblefragment ion masses

Choose number of fragment ions to select

Randomly selectfragment ions

Search and store result

Average over peptides

6

8 97

5

1825.921710.891609.841496.761365.721236.681123.591036.56923.48824.41753.37656.32569.29470.22413.20316.15201.12114.09

175.12290.15391.19504.28635.32764.36877.44964.481077.561176.631247.671344.721431.751530.821587.841684.891799.921886.95

LSDPGVSPAVLSLEMLTDR

Simulations using synthetic spectra

Select a peptide sequence

Calculate possiblefragment ion masses

Choose number of fragment ions to select

Randomly selectfragment ions

Search and store result

Average over peptides

LSDPGVSPAVLSLEMLTDRProt.seq.

201.12504.28964.481123.591247.671496.761530.821710.89

201.12504.28964.481123.591247.671496.761530.821710.89

Seq.DB

SearchengineIdentification

Is it significant?

Is the identified sequence identical to the one used to generate the synthetic data?

LSDPGVSPAVLSLEMLTDR

8

Select a peptide sequence

Calculate possiblefragment ion masses

Choose number of fragment ions to select

Randomly selectfragment ions

Search and store result

Average over peptides

Simulations using synthetic spectra

Each point is an average of searches with 20 randomly generated synthetic fragment mass spectra.

Threshold

Each point is an average of 50 peptides.

Average over peptides

Critical number of fragment masses

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20

Pro

ba

bili

ty o

f Id

en

tifi

ca

tio

n

Number of fragment ions

1000 Da1500 Da2000 Da2500 Da

Small peptides are slightly more difficult to identify

Dmprecursor = 1 DaDmfragment = 0.5 DaNo modification

mprecursor

A lower precursor mass error requires fewer fragment masses for identification of unmodified peptides

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20

Pro

ba

bili

ty o

f Id

en

tifi

ca

tio

n

Number of fragment ions

0.01 Da

1 Da

10 Da

mprecursor = 2000 DaDmfragment = 0.5 DaNo modification

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20

Pro

ba

bili

ty o

f Id

en

tifi

ca

tio

n

Number of fragment ions

0.01 Da0.5 Da1 Da2 Da

The dependence on the fragment mass error is weak below a threshold for identification

of unmodified peptides

Dmfragment

mprecursor = 2000 DaDmprecursor = 1 DaNo modification

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20

Pro

ba

bili

ty o

f Id

en

tifi

ca

tio

n

Number of fragment ions

0%

50%

80%

A moderate number of background peaks can be tolerated when identifying

unmodified peptides

mprecursor = 2000 DaDmprecursor = 1 DaDmfragment = 0.5 DaNo modification

Background

A large number of background peaks can be tolerated if the fragment mass is accurate

mprecursor = 2000 DaDmprecursor = 1 DaDmfragment = 0.01 DaNo modification

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20

Pro

ba

bili

ty o

f Id

en

tifi

ca

tio

n

Number of fragment ions

0%

50%

80%

Background

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20

Pro

ba

bili

ty o

f Id

en

tifi

ca

tio

n

Number of fragment ions

Phosphorylated

Unmodified

Identification of phosphopeptides is only slightly more difficult

mprecursor = 2000 DaDmprecursor = 1 DaDmfragment = 0.5 Da

Proteomics Informatics – Protein identification I: searching protein

sequence collections and significance testing (Week 4)