+ All Categories
Home > Documents > Protein Classification

Protein Classification

Date post: 21-Mar-2016
Category:
Upload: robyn
View: 70 times
Download: 1 times
Share this document with a friend
Description:
Protein Classification. PDB Growth. New PDB structures. Protein classification. Number of protein sequences grow exponentially Number of solved structures grow exponentially Number of new folds identified very small (and close to constant) Protein classification can - PowerPoint PPT Presentation
Popular Tags:
40
Protein Classification
Transcript
Page 1: Protein Classification

Protein Classification

Page 2: Protein Classification

PDB GrowthN

ew P

DB

stru

ctur

es

Page 3: Protein Classification

Protein classification

• Number of protein sequences grow exponentially• Number of solved structures grow exponentially• Number of new folds identified very small (and close to constant)• Protein classification can

Generate overview of structure types Detect similarities (evolutionary relationships) between protein sequences

Morten Nielsen,CBS, BioCentrum, DTU

SCOP release 1.67, Class # folds # superfamilies # families

All alpha proteins 202 342 550

All beta proteins 141 280 529

Alpha and beta proteins (a/b) 130 213 593

Alpha and beta proteins (a+b) 260 386 650

Multi-domain proteins 40 40 55

Membrane & cell surface 42 82 91

Small proteins 72 104 162

Total 887 1447 2630

Page 4: Protein Classification

Protein world

Protein fold

Protein structure classification

Protein superfamily

Protein family

Morten Nielsen,CBS, BioCentrum, DTU

Page 5: Protein Classification

Structure Classification Databases

• SCOP Manual classification (A. Murzin) scop.berkeley.edu

• CATH Semi manual classification (C. Orengo) www.biochem.ucl.ac.uk/bsm/cath

• FSSP Automatic classification (L. Holm) www.ebi.ac.uk/dali/fssp/fssp.html

Morten Nielsen,CBS, BioCentrum, DTU

Page 6: Protein Classification

Major classes in SCOP

• Classes All alpha proteins Alpha and beta proteins (a/b) Alpha and beta proteins (a+b) Multi-domain proteins Membrane and cell surface proteins Small proteins

Morten Nielsen,CBS, BioCentrum, DTU

Page 7: Protein Classification

All : Hemoglobin (1bab)

Morten Nielsen,CBS, BioCentrum, DTU

Page 8: Protein Classification

All : Immunoglobulin (8fab)

Morten Nielsen,CBS, BioCentrum, DTU

Page 9: Protein Classification

Triosephosphate isomerase (1hti)

Morten Nielsen,CBS, BioCentrum, DTU

Page 10: Protein Classification

: Lysozyme (1jsf)

Morten Nielsen,CBS, BioCentrum, DTU

Page 11: Protein Classification

Families

• Proteins whose evolutionarily relationship is readily recognizable from the sequence (>~25% sequence identity)

• Families are further subdivided into Proteins

• Proteins are divided into Species The same protein may be found in

several species

Fold

Family

Superfamily

Proteins

Morten Nielsen,CBS, BioCentrum, DTU

Page 12: Protein Classification

Superfamilies

• Proteins which are (remote) evolutionarily related

Sequence similarity low

Share function

Share special structural features

• Relationships between members of a superfamily may not be readily recognizable from the sequence alone

Fold

Family

Superfamily

Proteins

Morten Nielsen,CBS, BioCentrum, DTU

Page 13: Protein Classification

Folds

• Proteins which have >~50% secondary structure elements arranged the in the same order in the protein chain and in three dimensions are classified as having the same fold

• No evolutionary relation between proteins

Fold

Family

Superfamily

Proteins

Morten Nielsen,CBS, BioCentrum, DTU

Page 14: Protein Classification

Protein Classification

• Given a new protein, can we place it in its “correct” position within an existing protein hierarchy?

Methods

• BLAST / PsiBLAST

• Profile HMMs

• Supervised Machine Learning methods

Fold

Family

Superfamily

Proteins

?

new protein

Page 15: Protein Classification

PSI-BLAST

Given a sequence query x, and database D

1. Find all pairwise alignments of x to sequences in D2. Collect all matches of x to y with some minimum significance3. Construct position specific matrix M

• Each sequence y is given a weight so that many similar sequences cannot have much influence on a position (Henikoff & Henikoff 1994)

4. Using the matrix M, search D for more matches5. Iterate 1–4 until convergence

Profile M

Page 16: Protein Classification

Profile HMMs

• Each M state has a position-specific pre-computed substitution table• Each I and D state has position-specific gap penalties

• Profile is a generative model: The sequence X that is aligned to H, is thought of as “generated by” H Therefore, H parameterizes a conditional distribution P(X | H)

Protein profile H

M1 M2 Mm

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

Page 17: Protein Classification

Classification with Profile HMMs

M1 M2 Mm

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

Fold

Family

Superfamily

?M1 M2 Mm

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

M1 M2 Mm

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

new protein

Page 18: Protein Classification

Classification with Profile HMMs

• How generative models work

Training examples ( sequences known to be members of family ): positive

Model assigns a probability to any given protein sequence.

The sequence from that family yield a higher probability than that of outside family.

• Log-likelihood ratio as score

P(X | H1) P(H1) P(H1|X) P(X) P(H1|X) L(X) = log -------------------------- = log --------------------- = log -------------- P(X | H0) P(H0) P(H0|X) P(X) P(H0|X)

Page 19: Protein Classification

Generation of a protein by a profile HMM

P(X | H) ??

To generate sequence x1…xn by profile HMM H:

We will find the sum probability of all possible ways to generate X

• Define Aj

M(i): probability of generating x1…xi and ending with xi being emitted from Mj

AjI(i): probability of generating of x1…xi and ending with xi being

emitted from Ij

AjD(i): probability of generating of x1…xi and ending in Dj • (xi is the last character emitted before Dj)

Page 20: Protein Classification

Alignment of a protein to a profile HMM

AjM(i) = εM(j)(xi) * { Aj-1

M(i – 1) + log αM(j-1)M(j) + Aj-1

I(i – 1) + log αI(j-1)M(j) + Aj-1

D(i – 1) + log αD(j-1)M(j) }

AjI(i) = εI(j)(xi) * { Aj

M(i – 1) + log αM(j)I(j) +Aj

I(i – 1) + log αI(j)I(j) +Aj

D(i – 1) + log αD(j)I(j) }

AjD(i) = { Aj-1

M(i) + log αM(j-1)D(j) +Aj-1

I(i) + log αI(j-1)D(j) +Aj-1

D(i) + log αD(j-1)D(j) }

Page 21: Protein Classification

Generative Models

Page 22: Protein Classification

Generative Models

Page 23: Protein Classification

Generative Models

Page 24: Protein Classification

Generative Models

Page 25: Protein Classification

Generative Models

Page 26: Protein Classification

Discriminative Methods

Instead of modeling the process that generates data, directly discriminate between classes

• More direct way to the goal• Better if model is not accurate

Page 27: Protein Classification

Discriminative Models -- SVM

v

Decision Rule:red: vTx > 0

marginIf x1 … xn training examples,

sign(iixiTx) “decides” where x falls

• Train i to achieve best margin

Large Margin for |v| < 1 Margin of 1 for small |v|

Page 28: Protein Classification

Discriminative protein classification

Jaakkola, Diekhans, Haussler, ISMB 1999

• Define the discriminating function to be

L(X) = XiH1 i K(X, Xi) - XjH0 j K(X, Xj)

We decide X family H whenever L(X) > 0

• For now, let’s just assume K(.,.) is a similarity function

• Then, we want to train i so that this classifier makes as few mistakes as possible in the new data

• Similarly to SVMs, train i so that margin is largest for 0 i 1

Page 29: Protein Classification

Discriminative protein classification

• Ideally, for training examples, L(Xi) ≥ 1 if Xi H1, L(Xi) -1 otherwise

• This is not always possible; softer constraints are obtained with the following objective function

J() = XiH1 i(2 - L(Xi)) - XjH0 j(2 + L(Xj))

• Training: for Xi H, try to “make” L(Xi) = 1

1 - L(Xi) + i K(Xi, Xi) i -----------------------------; with minimum allowable value 0, and maximum 1

K(Xi, Xi)

• Similarly, for Xi H0 try to “make” L(Xi) = -1

Page 30: Protein Classification

The Fisher Kernel

• The function K(X, Y) compares two sequences Acts effectively as an inner product in a (non-Euclidean) space Called “Kernel”

• Has to be positive definite• For any X1, …, Xn, the matrix K: Kij = K(Xi, Xj) is such that

For any X Rn, X ≠ 0, XT K X > 0

• Choice of this function is important

• Consider P(X | H1, ) – sufficient statistics How many expected times X takes each transition/emission

M1 M2 Mm

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

Page 31: Protein Classification

The Fisher Kernel

• Fisher score UX = log P(X | H1, ) Quantifies how each parameter contributes to generating X For two different sequences X and Y, can compare UX, UY

• D2F(X, Y) = ½ 2 |UX – UY|2

• Given this distance function, K(X, Y) is defined as a similarity measure: K(X, Y) = exp(-D2

F(X, Y)) Set so that the average distance of training sequences Xi H1 to

sequences Xj H0 is 1

M1 M2 Mm

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

Question:Is partial derivative larger when X “uses” a given parameter I

more or less often?

Question:Is partial derivative larger when

a given parameter I islarger or smaller?

Page 32: Protein Classification

The Fisher Kernel

• In summary, to distinguish between family H1 and (non-family) H0, define Profile H1

UX = log P(X | H1, ) (Fisher score) D2

F(X, Y) = ½ 2 |UX – UY|2 (distance) K(X, Y) = exp(-D2

F(X, Y)), (akin to dot product)

L(X) = XiH1 i K(X, Xi) – XjH0 j K(X, Xj)

• Iteratively adjust to optimize J() = XiH1 i(2 - L(Xi)) – XjH0 j(2 + L(Xj))

Page 33: Protein Classification

The Fisher Kernel

• If a given superfamily has more than one profile model,

Lmax(X) = maxi Li(X) = maxi (XjHi j K(X, Xj) – XjH0 j K(X, Xj))

M1 M2 Mm

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

Family

Superfamily

M1 M2 Mm

BEGIN I0 I1 Im-1

D1 D2 Dm

ENDIm

Dm-1

Page 34: Protein Classification

Benchmarks

• Methods evaluated

BLAST (Altschul et al. 1990; Gish & States 1993) HMMs using SAM-T98 methodology (Park et al. 1998; Karplus, Barrett, &

Hughey 1998; Hughey & Krogh 1995, 1996) SVM-Fisher

• Measurement of recognition rate for members of superfamilies of SCOP (Hubbard et al. 1997)

PDB90 eliminates redundant sequences Withhold all members of a given SCOP family Train with the remaining members of SCOP superfamily Test with withheld data Question: “Could the method discover a new family of a known

superfamily?”O. Jangmin

Page 35: Protein Classification

O. Jangmin

Page 36: Protein Classification

Other methods

• WU-BLAST version 2.0a16 (Althcshul & Gish 1996)

PDB90 database was queried with each positive training examples, and E-values were recorded.

BLAST:SCOP-only

BLAST:SCOP+SAM-T98-homologs

Scores were combined by the maximum method

• SAM-T98 method

Same data and same set of models as in the SVM-Fisher

Combined with maximum methods

O. Jangmin

Page 37: Protein Classification

Results

• Metric : the rate of false positives (RFP)

• RFP for a positive test sequence : the fraction of negative test sequences that score as good of better than positive sequence

• Result of the family of the nucleotide triphosphate hydrolases SCOP superfamily

Test the ability to distinguish 8 PDB90 G proteins from 2439 sequences in other SCOP folds

O. Jangmin

Page 38: Protein Classification

Table 1. Rate of false positives for G proteins family. BLAST = BLAST:SCOP-only, B-Hom = BLAST:SCOP+SAMT-98-homologs, S-T98 = SAMT-98, and SVM-F = SVM-Fisher method

O. Jangmin

Page 39: Protein Classification
Page 40: Protein Classification

QUESTION

Running time of Fisher kernel SVM on query X?


Recommended