Statistical Inference Using Graphs for Protein Complex Identification Denise Scholtens Robert...

Post on 31-Mar-2015

220 views 0 download

Tags:

transcript

Statistical Inference Using Graphs for Protein

Complex Identification

Denise ScholtensRobert Gentleman

Marc Vidal

Workshop on Statistical Inference, Computing, and Visualization for Graphs

Stanford UniversityAugust 1-2, 2003

Graphic from:U.S. Department of Energy Human Genome Program http://www.ornl.gov/hgmis

High-throughput Protein Complex Identification

• Gavin, et al. (Nature, 2002)– TAP : Tandem Affinity Purification

• Ho, et al. (Nature, 2002)– HMS-PCI: High-throughput Mass

Spectromic Protein Complex Identification

Protein Complex Identification Using TAP

Data Spoke Model

Matrix Model

Bader, et al. (Nature Biotechnology, 2002)

Protein-Complex Affiliation Network Incidence Matrix

C1 C2 C3 C4 C5 … Cm

P1

P2

P3

P4

P5

P6

P7…

Pn

1110000…0

0000111…0

1001000…0

1101000…0

1111000…0

A =

Cohesive vs. Dynamic Protein Complexes

Cohesive Complex: a complex of invariable composition whose proteins are associated only with that complex and its particular function

Cohesive Complex Affiliation Network Incidence Matrix

C1

Bait

Hit 1

Hit 2

Hit 3

Hit 4Hit 5

111111

A =

Cohesive vs. Dynamic Protein Complexes

Dynamic Complex: complex composed of proteins that may also be involved in other complexes

Dynamic Complex Affiliation Network Incidence Matrices

A =

C1 C2 C3 C4 C5

Bait 1 1 1 1 1

Hit 1 1 0 0 0 0

Hit 2 0 1 0 0 0

Hit 3 0 0 1 0 0

Hit 4 0 0 0 1 0

Hit 5 0 0 0 0 1

C1 C2

Bait 1 1

Hit 1 1 0

Hit 2 0 1

Hit 3 1 0

Hit 4 0 1

Hit 5 1 0

A =

C1 C2

Bait 1 1

Hit 1 1 1

Hit 2 1 1

Hit 3 0 1

Hit 4 0 1

Hit 5 0 1

A =

All 5 “complexes” above would yield the same TAP Data:

Statistical Inference Problem

• What is A?• A captures the cohesive/dynamic

distinction.• At best, we observe all but the main

diagonal of X=AA’.

• Current analyses focus on X, not on A.

Protein Complex Data as a Directed Graph

?

Cohesive Complex described in Gavin, et al.

Dynamic Complex described in Gavin, et al.

Largest Connected Component in Gavin, et al. using Bait Proteins Only,

Colored by Outdegree

Gavin Data Ho Data

SubGraph of Bait Proteins from Previous Graphs with Outdegree 7

Gavin Data Ho Data

Examples of Distinct Complexes Identified by Gavin, et al.

Back to Affiliation Networks

C1

B1 1

B2 1

B3 1

A =

B1 B2 B3

B1 1 1 1

B2 1 1 1

B3 1 1 1

X=AA’ =

One Three-Way Conversation

Affiliation Networks

C1 C2 C3

B1 1 1 0

B2 1 0 1

B3 0 1 1

A =

B1 B2 B3

B1 2 1 1

B2 1 2 1

B3 1 1 2

X=AA’ =

Three Two-Way Conversations

Statistical Inference Problem

• Which A is correct?– A uniquely defines X, but X does not

uniquely define the observable part of A.

• Extra information and directed graph model for the TAP data– Cellular Component Data– Gene Expression Data– Hit Data

Possible Use of Hit Data to Help Estimate A

Conclusions

• In the protein complex setting, directed graphs are useful for EDA, as well as framing the correct questions for statistical inference.

• Statistical inference problem for cohesive and dynamic protein complex identification should focus on A, not X.

• Digraph model of the TAP data better reflects what we actually observe, and is informative for estimating A.