On finding clusters in undirected simple graphs: application to protein complex detection

1. On finding clusters in undirected simple graphs: application to protein complex detection

2. DPClus software tool

3. Concept of Line Graphs

Today’s lecture will cover the following three topics

Comparative Genomics

(Network Biology)

Outline

•Introduction

•Some basic concepts

•The proposed algorithm

•The DPClus software

•Results & Discussion

•Conclusions

On finding clusters in undirected simple graphs: application to protein complex detection

Introduction

•There is no universal definition of a cluster.

•But clustering is an important issue.

•Consequently there are diverse definitions and various methods.•The major purpose of clustering is finding cohesive groups.

•Here, we are going to discuss a graph clustering algorithm.

Regarding a graph, a cluster is a subgraph whose nodes are densely connected with each other compared to their connections with other nodes in the graph.

This is a flexible definition of a cluster.

Intuitively, we can recognize two clusters in this arbitrary graph.

Introduction

But it is difficult to draw a big graph revealing its clusters.

An E. coli protein-protein interaction network---consisting of 3007 proteins and 11531 interactions (From Mori Lab NAIST, Japan)

Some algorithm is needed to detect locally dense regions……

Introduction

Md. Altaf-Ul-Amin, Yoko Shinbo, Kenji Mihara, Ken Kurokawa and Shigehiko Kanaya, “Development and implementation of an algorithm for detection of protein complexes in large interaction networks”, BMC Bioinformatics 7:207, April 2006.

Introduction

Some basic concepts

It is likely that two nodes belong to the same cluster have more common neighbors than two nodes that are not

It is likely that two nodes belong to the same cluster have more common neighbors than two nodes that are not

Some basic concepts

•The density d of a cluster is the ratio of the number of edges present in it and the maximum possible number of edges in it.

•It is easy to realize that d = |E|/|E|max = 2*|E|/|N|*(|

N|-1).

•d is a real number ranging from 0 to 1.

Some basic concepts

Density of the total graph = 0.241

d=0.9d=1.0

The density of the complexes are relatively higher

Some basic concepts

Considering density alone is not enough

Such situations can be tackled by keeping track of the periphery

Some basic concepts

•Both the graphs consist of 8 nodes and both are of density 0.5

•But one of them seems to be a single cluster while the other is divided into two clusters

a

b c

d

e

g f

h

a

b

cd

ef

g

h

Some basic concepts

The cluster property of any node n with respect to any cluster k of density dk and size Nk is defined as follows:

cpnk=|Enk|/(dk* |Nk|)

Here, |Enk| is the total number of edges between the node n and each of the nodes of cluster k.

a

b c

d

e

g f

h

a

b

cd

ef

g

h

Cluster property of node f 0.57

Cluster property of node f = 0.2

The proposed algorithm is a sequential constructive algorithm:

It initializes the complex/cluster by choosing a seed node.

It then repeatedly add other nodes on the basis of priority and some conditions.

The major methods of the algorithm

•Choosing a seed node.

•Selecting a priority node.

•Checking necessary conditions before adding a node to a complex.

The proposed Algorithm

Inputs to the algorithm are:

•The associated matrix of the network.

•A minimum threshold density for the generated clusters.

•A parameter to determine how we separate a complex from its periphery.

Output of the algorithm are :

Overlapping/non-overlapping complexes whose densities are more or equal to the given density.


-

The proposed AlgorithmInput an undirected simple graph G.

Set thresholds din and cpin

and initialize cluster ID k = 1.

Generate degrees of the nodes of G.Determine the highest highest node degree (Dh). Dk= 0

Start at highest weight nodeof G as the kth cluster.

dk > din

No

Yescpp(k-p) > cpin

Yes

No

Deduct the last added node from kth cluster.

No

End

All neighbors of kth cluster are checked?

No

Yes

Print kth cluster.G G – kth cluster

k k+1.

Yes

Input & Initialization

Generate weight of each node of G.

highest node weight= 0 YesNo

Start at highest degree nodeof G as the kth cluster.

Generate the neighbors of the kth cluster in G. and sort them according to priority.Add the highest prority neigbor (p) to the cluster.

Add the next priority neighbor (p) to kth cluster.

Termination check

Seed selection

Cluster formation

Output & update

Flowchart of the proposed Algorithm

0 1 0 0 0 0 0 0 0 0 0 0 0 0

1 0 1 1 0 1 0 0 0 0 0 0 0 0

0 1 0 1 1 1 0 0 0 0 0 0 0 0

0 1 1 0 1 1 0 1 0 0 0 0 0 0

0 0 1 1 0 1 0 0 0 0 0 0 0 0

0 1 1 1 1 0 1 0 0 0 0 0 0 0

0 0 0 0 0 1 0 0 0 0 1 0 0 0

0 0 0 1 0 0 0 0 1 0 0 0 0 0

0 0 0 0 0 0 0 1 0 1 0 0 1 1

0 0 0 0 0 0 0 0 1 0 1 0 1 1

0 0 0 0 0 0 1 0 0 1 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 0 1 1 0 1 0 1

0 0 0 0 0 0 0 0 1 1 0 0 1 0

M =

Muv = 1 if there is an edge between

nodes u and v and 0 otherwise.


1 0 1 1 0 1 0 0 0 0 0 0 0 0

0 4 2 2 3 2 1 1 0 0 0 0 0 0

1 2 4 3 2 3 1 1 0 0 0 0 0 0

1 2 3 5 2 3 1 0 1 0 0 0 0 0

0 3 2 2 3 2 1 1 0 0 0 0 0 0

1 2 3 3 2 5 0 1 0 0 1 0 0 0

0 1 1 1 1 0 2 0 0 1 0 0 0 0

0 1 1 0 1 1 0 2 0 1 0 0 1 1

0 0 0 1 0 0 0 0 4 2 1 1 2 2

0 0 0 0 0 0 1 1 2 4 0 1 2 2

0 0 0 0 0 1 0 0 1 0 2 0 1 1

0 0 0 0 0 0 0 0 1 1 0 1 0 1

0 0 0 0 0 0 0 1 2 2 1 0 4 2

0 0 0 0 0 0 0 1 2 2 1 1 2 3

M2 =

(M2)uv for uv represents the

number of common neighbor of the nodes u and v.


1 0 1 1 0 1 0 0 0 0 0 0 0 0

0 4 2 2 3 2 1 1 0 0 0 0 0 0

1 2 4 3 2 3 1 1 0 0 0 0 0 0

1 2 3 5 2 3 1 0 1 0 0 0 0 0

0 3 2 2 3 2 1 1 0 0 0 0 0 0

1 2 3 3 2 5 0 1 0 0 1 0 0 0

0 1 1 1 1 0 2 0 0 1 0 0 0 0

0 1 1 0 1 1 0 2 0 1 0 0 1 1

0 0 0 1 0 0 0 0 4 2 1 1 2 2

0 0 0 0 0 0 1 1 2 4 0 1 2 2

0 0 0 0 0 1 0 0 1 0 2 0 1 1

0 0 0 0 0 0 0 0 1 1 0 1 0 1

0 0 0 0 0 0 0 1 2 2 1 0 4 2

0 0 0 0 0 0 0 1 2 2 1 1 2 3

M2 =

(M2)uv for uv represents the

number of common neighbor of the nodes u and v.


2

2

3

22

0

3

2

2

0 02

2

2

2

23

0

0

00

2


The weights of edges are derived by squaring the associated matrix of the graph

2

2

3

22

0

3

2

2

0 02

2

2

2

23

0

0

00

2

10

10 6

10

6

0

6

6

0

0

6

0

06


The weights of nodes (sum of the weights of the connecting edges)

2

2

3

22

0

3

2

2

0 02

2

2

2

23

0

0

00

2

10

10 6

10

6

0

6

6

0

0

6

0

06

Sum of edge weights

# of edges

P1 2 1

P3 3 1

P4 2 1

P5 3 1


Seed

Neighbors

2

2

3

22

0

3

2

2

0 02

2

2

2

23

0

0

00

2

10

10 6

10

6

0

6

6

0

0

6

0

06

Sum of edge weights

# of edges

P3 3 1

P5 3 1

P1 2 1

P4 2 1


Neighbors

cp of P3 = 1

2

2

3

22

0

3

2

2

0 02

2

2

2

23

0

0

00

2

10

10 6

10

6

0

6

6

0

0

6

0

06

Sum of edge weights

# of edges

P1 4 2

P4 4 2

P5 6 2

P7 0 1

d=1.0

Neighbors


2

2

3

22

0

3

2

2

0 02

2

2

2

23

0

0

00

2

10

10 6

10

6

0

6

6

0

0

6

0

06

Sum of edge weights

# of edges

P5 6 2

P1 4 2

P4 4 2

P7 0 1

d=1.0

Neighbors


cp of P5 = 1

2

2

3

22

0

3

2

2

0 02

2

2

2

23

0

0

00

2

10

10 6

10

6

0

6

6

0

0

6

0

06

Sum of edge weights

# of edges

P1 4 2

P4 4 2

P6 0 1

P7 0 1

d=1.0

Neighbors


cp of P1 = 1

2

2

3

22

0

3

2

2

0 02

2

2

2

23

0

0

00

2

10

10 6

10

6

0

6

6

0

0

6

0

06

Sum of edge weights

# of edges

P0 0 1

P4 4 2

P6 0 1

P7 0 1

d=1.0

Neighbors


2

2

3

22

0

3

2

2

0 02

2

2

2

23

0

0

00

2

10

10 6

10

6

0

6

6

0

0

6

0

06

Sum of edge weights

# of edges

P4 4 2

P0 0 1

P6 0 1

P7 0 1

d=1.0

Neighbors


cp of P4 = 0.75

2

2

3

22

0

3

2

2

0 02

2

2

2

23

0

0

00

2

10

10 6

10

6

0

6

6

0

0

6

0

06

d=0.9

Neighbors


Sum of edge weights

# of edges

cp-value

P0 0 1 ~0.22

P6 0 1 ~0.22

P7 0 1 ~0.22

02

2

2

2

2

0

0

0

2

6

0

6

6

0

6

0

0


The remaining graph

Seed

02

2

2

2

2

0

0

0

2

6

0

6

6

0

6

0

0

d=1.0


02

2

2

2

2

0

0

0

2

6

0

6

6

0

6

0

0

d=1.0


02

2

2

2

2

0

0

0

2

6

0

6

6

0

6

0

0

d=1.0



The remaining graph


Clustering by the proposed algorithm

Example

A

B

D

C

E

L

F

H

G

K

J

I

( )ⅰ

1. Input and Initialized cpin=0.4, din = 0.6

A

B

D

C

E

L

F

H

G

K

J

I

( )ⅰ

A

B

D

C

E

L

F

H

G

K

J

I

2

2

2 22

31

21

1

0

1

11

0

1

1

1

1. Seed Selection-1: calculation of weights of edges

1. Seed selection-2: Calculation of weights of nodes A

B

D

C

E

L

F

H

G

K

J

I

( )ⅲクラスター 1 のシード選択

2

2

2 22

31

21

1

0

1

11

0

1

1

1

6

6

10

8

4

2

2

2

2

2

2

2

Selected seed

2. Cluster formation-1 Calculation of weights of nodes

A

B

D

C

E

L

F

H

G

K

J

I

( )ⅳ

223

21

Cluster 1d1=1

クラスター１の形成

22

3

2

1

Cluster 1d1=1

Candidate merged to Cluster 1

1

2. Cluster formation-2

A

B

D

C

E

L

F

H

G

K

J

I

( )ⅴ

Check thresholds OK d1=1/1=1 > 0.6

cpC1=1/(1*1)=1 > 0.4 (cpin )

2

2 22

2

クラスター１の形成

4

4

3

1

1


1


A

B

D

C

E

L

F

H

G

K

J

I

( )ⅵクラスター１の形成

cpA1=2/(1x2)=1>0.4

Cluster 1 d1=3/3=1

2

2

12

1

1

3

62


A

B

D

C

E

L

F

H

G

K

J

I

( )ⅶクラスター 1 の形成

21

1

1

3

Check thresholds OK d1=1/1=1 > 0.6

cpB1=3/(1x3)=1 > 0.4 (cpin )



A

B

D

C

E

L

F

H

G

K

J

I

( )ⅷクラスター 1 の形成

0 11

2

0

Check thresholds OK d1=8/10=0.8 > 0.6

cpL1=2/(1*4)=0.5 > 0.4 (cpin )



A

B

D

C

E

L

F

H

G

K

J

I

( )ⅸクラスター 1 の探索

0

0

0

0

Check thresholds OK d1=10/15=0.67 > 0.6

cpE1=2/(0.8*5)=0.6 > 0.4 (cpin )



A

B

D

C

E

L

F

H

G

K

J

I


0

0

0

0

Check thresholds Out d1=11/12=0.52 < 0.6

cpE1=1/(0.52*6)=0.32 < 0.4 (cpin )


A

B

D

C

E

L

F

H

G

K

J

I


0

0

0

0


cpF1=1/(0.52*6)=0.32 < 0.4 (cpin )


A

B

D

C

E

L

F

H

G

K

J

I


0

0

0

0


cpF1=1/(0.52*6)=0.0 < 0.4 (cpin )

2. Cluster formation-9: Remove the edges and nodes belonging to Cluster 1

F

H

G

K

J

I

( )ⅹクラスター 1 を削除

Results of Density Periphery Clustering

A

B

D

C

E

L

F

H

G

K

J

I

( )ⅹ終了

Cluster 1d1=10/15=0.67

Cluster 2d2=3/3=1

Cluster 3d3=3/3=1

ⅰ

Results: Complexes in the E. coli PPI Network

The network of E. coli proteins consists of 363 interactions involving a total of 336 proteins

DIP:339N GroEL DIP:1081N PrnP

DIP:1025N CarB DIP:1026N CarA

DIP:539N MalG DIP:508N MalE

DIP:124N XerD DIP:726N XerC

DIP:367N PntB DIP:366N PntA

DIP:342N SbcC DIP:572N Gam

-------------- --------- -------------- ---------

-------------- --------- -------------- ---------

http://dip.mbi.ucla.edu/

components of RNA polymerase (RpoA, RpoB, RpoC, Rsd, RpoZ RpoD, RpoN, FliA)


components of ATP synthetase (AtpA, AtpB, AtpE, AtpF, AtpG, AtpH, AtpL);


Proteins involved in cell division (FtsQ, FtsI, FtsW, FtsN, FtsK and FtsL)


components of DNA polymerase (DnaX, HolA, HolB, HolD, and HolC);


We extract a set of 12487 unique binary interactions involving 4648 proteins by discarding self-interactions of the PPI data obtained from ftp://ftpmips.gsf.de/yeast/PPI/.

Results: Complexes in the S. cerevisiae PPI Network

Results: Details of a Group of Predicted Complexes

Information on the complexes that are of size 6 of the set generated using din=0.7, cpin=0.50 and non-overlapping mode.

ID

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

1 5 10 15

17

13

14

14

12

12

11

9

8

8

8

8

8

7

7

7

7

6

6

6

6

6

6

6

6

6

6

6

28 0.71

0.72

1.00

0.83

0.71

0.94

0.71

0.98

0.72

0.93

0.72

0.71

0.71

0.71

0.95

0.76

0.71

0.71

0.80

0.80

0.73

0.73

0.73

0.73

0.73

0.73

0.73

0.73

0.73

CTF4,CTF8,CTF18,CTF19,CIN1,CIN2,CIN8,GIM3,GIM4,GIM5,MAD1,MAD2,MAD3,BUB1,BUB3,PAC2,PAC10,ARP6,BIK1,BIM1,CHL1,CSM3, DCC1,HTZ1,KAR3,SCC1-73,TUB3,YKE2

CHS3,CHS5,CHS7,BNI1,BNI4,RVS161,RVS167,ARC40,ARP2,BCK1,CLA4,FKS1,KRE1,SKT5,SLT2, SMI1,SWI4

TAF17,TAF25,TAF60,TAF61,TAF90,SPT3,SPT7,SPT8,SPT20,ADA2,GCN5,HFI1,NGG1,TRA1

LSM1,LSM2,LSM3,LSM4,LSM5,LSM6,LSM7,LSM8,DCP1,KEM1,MRNa,PAT1,SNRNa,U6

RAD27,RAD50,CDC45-1,ELG1,ESC2,HPR5,MMS4,MRC1,POL32,RRM3,SGS1,TOF1,TOP3

TRS20,TRS23,TRS31,TRS33,TRS65,TRS85,TRS120,TRS130,BET3,BET5,GSG1,KRE11

COG5,COG6,COG7,COG8,ARL1,ARL3,GOS1,GYP1,RIC1,SWF1,TLG2,YPT6

APC1,APC2,APC4,APC5,APC9,APC11,CDC16,CDC23,CDC26,CDC27,DOC1

CDC73,CTI6,DEP1,LEO1,SAP30,SET2,SIF2,SWR1,VPS71

CFT1,CFT2,FIP1,PAP1,PFS2,PTA1,YSH1,YTH1

MED2,MED4,MED7,MED8,PGD1,RPB3,SOH1,SRB4

BEM1,BEM2,BOI1,BOI2,CDC24,CDC42,MSB1,STE20

ARP1,ASE1,CLB4,JNM1,KAR9,KIP3,NIP100,PAC11

CDC4,CDC34,CDC53,CLN1,CLN2,CLN3,SIC1,SKP1

CDC3,CDC10,CDC11,CDC12,GIN4,SEP7,SHS1

CKA1,CKA2,CKB1,CKB2,CDC7-1,RHO3,TOP2

SNR3,SNR10,SNR11,SNR189,GAR1,NHP2,NOP10

SPC19,SPC24,NNF1,NUF2,SMC1,TID3,YDR295c

YGL161c,YGL198w,GCS1,YDR425w,YIP1,YPL095c

PRP5,PRP9,PRP11,PRP21,NOG2,YNR053c

NUP49,NUP57,APG17,NIC96,NSP1,SEC35

KTR3,LAS17,SLA1,YFR024c,YOR284w,YSC84

ECM31,GCD7,NIP29,TEM1,YJL199c,YPL070w

ERB1,HAS1,NIP7,NOP7,NUG1,SSF1

SEC2,SEC4,SEC10,SEC15,MYO2,SMY1

MYO3,MYO5,BBC1,BZZ1,UBP7,VRP1

DBF2,DBF20,CDC15,LTE1,MOB1,SPO12

HHF1,HHF2,HHT1,HHT2,SPT6,STH1

CBF1,CEP3,CHL4,CTF13,MCM21,MIF2

N d Function Class Gene Name

YIP1

GCS1

YGL161c

YPL095c

YGL198w

YDR425w

(a) (b)

3.9x10-17

9.0x10-13

1.7x10-11

1.1x10-6

3.7x10-4

3.4x10-11

4.0x10-6

2.1x10-10

1.9x10-5

4.8x10-7

3.4x10-5

3.1x10-9

4.5x10-7

6.8x10-7

3.5x10-6

5.4x10-3

1.3x10-4

3.5x10-6

9.5x10-4

1.3x10-7

6.3x10-10

1.0x10-4

4.8x10-1

2.3x10-3

2.4x10-5

1.0x10-4

1.2x10-3

1.8x10-5

2.3x10-5

Corrected P-value

We considered 15 functional classes: (1) Cell cycle and DNA processing, (2) Protein with binding function or cofactor requirement (structural or catalytic), (3) Protein fate (folding, modification, destination), (4) Biogenesis of cellular components, (5) Cellular transport, transport facilitation and transport routes, (6) Metabolism, (7) Interaction with the cellular environment, (8) Transcription, (9) Energy, (10) Cell rescue, defense and virulence, (11) Cell type differentiation, (12) Cellular communication/signal transduction mechanism, (13) Protein activity regulation, (14) Protein synthesis, and (15) Transposable elements, viral and plasmid proteins

1

01

k

i

C

N

iC

FN

i

F

P

Results: Hypergeometric distribution

N= Total number of proteins in the network

F= Number of proteins of a functional group in the network

C= Number of proteins in a cluster

k= Number of proteins of a functional group in a cluster

The p-value of a cluster implies the probability that the proteins of the cluster have been randomly selected

The lower the p-value the higher the statistical significance

3 green and 4 red balls

Put them in a box

Randomly choose any 3

P0(# of red ball is 0) = 35

1

3

7

3

3

0

4


12

3

7

2

3

1

4

P2(# of red ball is 2) = P3(# of red ball is 3) = 35

18

3

7

1

3

2

4

35

4

3

7

0

3

3

4

Notice that, P0 +P1+P2+P3=1

P-value & Hyper geometric distribution


1

3

7

3

3

0

4


12

3

7

2

3

1

4


18

3

7

1

3

2

4

35

4

3

7

0

3

3

4

0

0.1

0.2

0.3

0.4

0.5

0.6

0 1 32



1

3

7

3

3

0

4


12

3

7

2

3

1

4


18

3

7

1

3

2

4

35

4

3

7

0

3

3

4

P(# of red ball ≤ 1)= P0 +P1

P(# of red ball ≥ 2)=1-(P0 +P1)

P(# of red ball ≥ k)=1-(P0 +P1+…+Pk-1)

1

01

k

i

C

N

iC

FN

i

F

PN=7, F=4, C=3


Results: Details of a Group of Predicted Complexes

Information on the complexes that are of size 6 of the set generated using din=0.7, cpin=0.50 and non-overlapping mode.Protein YDR425w of complex 19 is related to cellular transport and YIP1, YGL198w, YGL161c and GCS1 are related to vesicular transport. Hence, we predict the function-unknown protein YPL095c of this complex is a transport related protein most likely related to vesicular transport.

ID

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

1 5 10 15

17

13

14

14

12

12

11

9

8

8

8

8

8

7

7

7

7

6

6

6

6

6

6

6

6

6

6

6

28 0.71

0.72

1.00

0.83

0.71

0.94

0.71

0.98

0.72

0.93

0.72

0.71

0.71

0.71

0.95

0.76

0.71

0.71

0.80

0.80

0.73

0.73

0.73

0.73

0.73

0.73

0.73

0.73

0.73

CTF4,CTF8,CTF18,CTF19,CIN1,CIN2,CIN8,GIM3,GIM4,GIM5,MAD1,MAD2,MAD3,BUB1,BUB3,PAC2,PAC10,ARP6,BIK1,BIM1,CHL1,CSM3, DCC1,HTZ1,KAR3,SCC1-73,TUB3,YKE2

CHS3,CHS5,CHS7,BNI1,BNI4,RVS161,RVS167,ARC40,ARP2,BCK1,CLA4,FKS1,KRE1,SKT5,SLT2, SMI1,SWI4

TAF17,TAF25,TAF60,TAF61,TAF90,SPT3,SPT7,SPT8,SPT20,ADA2,GCN5,HFI1,NGG1,TRA1

LSM1,LSM2,LSM3,LSM4,LSM5,LSM6,LSM7,LSM8,DCP1,KEM1,MRNa,PAT1,SNRNa,U6

RAD27,RAD50,CDC45-1,ELG1,ESC2,HPR5,MMS4,MRC1,POL32,RRM3,SGS1,TOF1,TOP3

TRS20,TRS23,TRS31,TRS33,TRS65,TRS85,TRS120,TRS130,BET3,BET5,GSG1,KRE11

COG5,COG6,COG7,COG8,ARL1,ARL3,GOS1,GYP1,RIC1,SWF1,TLG2,YPT6

APC1,APC2,APC4,APC5,APC9,APC11,CDC16,CDC23,CDC26,CDC27,DOC1

CDC73,CTI6,DEP1,LEO1,SAP30,SET2,SIF2,SWR1,VPS71

CFT1,CFT2,FIP1,PAP1,PFS2,PTA1,YSH1,YTH1

MED2,MED4,MED7,MED8,PGD1,RPB3,SOH1,SRB4

BEM1,BEM2,BOI1,BOI2,CDC24,CDC42,MSB1,STE20

ARP1,ASE1,CLB4,JNM1,KAR9,KIP3,NIP100,PAC11

CDC4,CDC34,CDC53,CLN1,CLN2,CLN3,SIC1,SKP1

CDC3,CDC10,CDC11,CDC12,GIN4,SEP7,SHS1

CKA1,CKA2,CKB1,CKB2,CDC7-1,RHO3,TOP2

SNR3,SNR10,SNR11,SNR189,GAR1,NHP2,NOP10

SPC19,SPC24,NNF1,NUF2,SMC1,TID3,YDR295c

YGL161c,YGL198w,GCS1,YDR425w,YIP1,YPL095c

PRP5,PRP9,PRP11,PRP21,NOG2,YNR053c

NUP49,NUP57,APG17,NIC96,NSP1,SEC35

KTR3,LAS17,SLA1,YFR024c,YOR284w,YSC84

ECM31,GCD7,NIP29,TEM1,YJL199c,YPL070w

ERB1,HAS1,NIP7,NOP7,NUG1,SSF1

SEC2,SEC4,SEC10,SEC15,MYO2,SMY1

MYO3,MYO5,BBC1,BZZ1,UBP7,VRP1

DBF2,DBF20,CDC15,LTE1,MOB1,SPO12

HHF1,HHF2,HHT1,HHT2,SPT6,STH1

CBF1,CEP3,CHL4,CTF13,MCM21,MIF2

N d Function Class Gene Name

YIP1

GCS1

YGL161c

YPL095c

YGL198w

YDR425w

(a) (b)

3.9x10-17

9.0x10-13

1.7x10-11

1.1x10-6

3.7x10-4

3.4x10-11

4.0x10-6

2.1x10-10

1.9x10-5

4.8x10-7

3.4x10-5

3.1x10-9

4.5x10-7

6.8x10-7

3.5x10-6

5.4x10-3

1.3x10-4

3.5x10-6

9.5x10-4

1.3x10-7

6.3x10-10

1.0x10-4

4.8x10-1

2.3x10-3

2.4x10-5

1.0x10-4

1.2x10-3

1.8x10-5

2.3x10-5

Corrected P-value

Conclusions

•In this work, we present an algorithm to detect locally dense regions in undirected simple graphs.

•The algorithm can be used to detect protein complexes in large protein-protein interaction networks or co-expressed gene clusters based on microarray data.

•It can also be used for protein/gene function prediction by way of finding complexes/clusters in networks consisting of function known and function unknown proteins.

•Also, DPClus can be applied to other networks where finding cohesive groups is an agenda.

The DPClus software is available at http://kanaya.naist.jp/DPClus/

Md. Altaf-Ul-Amin, Hisashi Tsuji, Ken Kurokawa, Hiroko Asahi, Yoko Shinbo, Shigehiko Kanaya, “DPClus: A Density-periphery Based Graph Clustering Software Mainly Focused on Detection of Protein Complexes in Interaction Networks”, Journal of Computer Aided Chemistry , Vol.7, 150-156, 2006.

2. The DPClus Software

The DPClus software is available at http://kanaya.naist.jp/DPClus/

The DPClus software has been developed based on the proposed algorithm.

The main window of DPClus

The DPClus Software

AtpB AtpAAtpG AtpEAtpA AtpHAtpB AtpHAtpG AtpHAtpE AtpH

The input file format0 0 1 0 1 0 0 0 1 1 1 0 0 0 1 0 1 0 0 1 1 1 1 1 0

List of edges

Corresponding network

Adjacency matrix

The DPClus Software

Adjacency list

AtpA AtpB, AtpH AtpB AtpA , AtpH AtpH AtpB, AtpA, AtpG, AtpE AtpG AtpH, AtpEAtpE AtpG

ClusterLength of cluster 1 is: 8RpoARpoBRpoCRsdRpoZRpoDRpoNFliAClusterLength of cluster 2 is: 8AtpHAtpGAtpBAtpAAtpFAtpLAtpEAtpB(A)ClusterLength of cluster 3 is: 5----------------------------------------------------------------------------

Output file format

The DPClus Software

Click!

Intra cluster edges are green and inter cluster edges are red

Nodes have been arranged by dragging

The DPClus Software

Click

Click

Click

Hierarchical graph of the clusters

The DPClus Software

Clustering of microarray data

Sample microarray data

To apply DPCcus, we need to convert this data to a network

The DPClus Software

Experiment ID

Genes

m

kjjk

m

kiik

m

kjjkiik

ij

xxxx

xxxxR

1

2

1

2

1

)()(

))((

Gene-Gene correlation

Select highly correlated gene pairs

Edges of a Network

At3g10060 At3g54150At3g10060 At3g63140At3g10060 At5g07020-------------- --------------------------- -------------

The DPClus Software

# of experiments 　 626 Threshold correlation 　 0.95cp value 0.5density value 0.9Minimum cluster size 3

The DPClus Software

Ribosomal proteinclusters

Electron transport clusters

Photosynthesis clusters

The DPClus Software

Line Graphs

Given a graph G, its line graph L(G) is a graph such thateach vertex of L(G) represents an edge of G; and two vertices of L(G) are adjacent if and only if their corresponding edges share a common endpoint ("are adjacent") in G.

Graph G Vertices in L(G) constructed from edges in G

Added edges in L(G)

The line graph L(G)

http://en.wikipedia.org/wiki/Line_graph

http://en.wikipedia.org/wiki/File:Line_graph_construction_1.svg








Line Graphs

RASCAL: Calculation of Graph Similarity using Maximum Common Edge SubgraphsBy JOHN W. RAYMOND1, ELEANOR J. GARDINER2 AND PETER WILLETT2

THE COMPUTER JOURNAL, Vol. 45, No. 6, 2002

The above paper has introduced a new graph similarity calculation procedure for comparing labeled graphs.

The chemical graphs G1 and G2 are shown in Figure a,and their respective line graphs are depicted in Figure b.

Line GraphsDetection of Functional Modules FromProtein Interaction NetworksBy Jose B. Pereira-Leal,1 Anton J. Enright,2 and Christos A. Ouzounis1

PROTEINS: Structure, Function, and Bioinformatics 54:49–57 (2004)

Transforming a network of proteins to a network of interactions. a) Schematic representation illustrating a graph representation of protein interactions: nodes correspond to proteins and edges to interactions. b) Schematic representation illustrating the transformation of the protein graph connected by interactions to an interaction graph connected by proteins. Each node represents a binary interaction and edges represent shared proteins. Note that labels that are not shared correspond to terminal nodes in (a)

A star is transformed into a clique

Date post:	20-Jan-2016
Category:	Documents
Upload:	dyan
View:	28 times
Download:	0 times

On finding clusters in undirected simple graphs: application to protein complex detection

Documents