Link Prediction in Co-Authorship Network

1

LINK PREDICTION IN CO-AUTHORSHIP NETWORKLe Nhat Minh ( A0074403N)

Supervisor: Dongyuan Lu

2

Introduction• Link prediction

• Introduce future connections within the network scope

• Co-authorship network• A network of collaborations among researchers, scientists,

academic writers

3

Introduction• Potential applications

• Recommend experts or group of researchers for individual

researcher.

4

Outline• Problem Background

• Related Work

• Workflow

• Conclusion

• Result Analysis

• Research plan

5

Problem Background

• What connect researchers together ?

• Given an instance of co-authorship network:

• A researcher connect to another if they collaborated on at least one

paper.

Problem

Background

Related

Work

Workflow

Conclusion

X2001

Y2004

X X

XY

6

Problem Background

• How to predict the link?

• Based on criteria:

• Co-authorship network topology

• Researcher’s personal information

• Researcher’s papers

• Boost up link predictions performance

• Recommend link should be really relevant to the interest of the

authors or at least possible for researcher to collaborate.

Problem

Background

Related

Work

Workflow

Conclusion

7

Related Work

• Link prediction problems in Social network • Liben‐Nowell, D., & Kleinberg, J., 2007

• Bliss, C. A., Frank, M. R., Danforth, C. M., & Dodds, P. S., 2013

• In social network, interactions among users are very

dynamic with:• Creation of new link within a few days

• Deletion or replacement of the existent links

• Different features present by the two networks• Characteristics of individual researcher : citations, affiliations , institutions, ...

• Characteristics of person : marriage status, ages, working places, …

Problem

Background

Related

Work

Workflow

Conclusion

8

• Three mainstream approaches for link prediction:

• Similarity based estimation

• Liben‐Nowell, D., & Kleinberg, J., 2007

• Maximum likelihood estimation

• Murata, T., & Moriyasu, S., 2008

• Guimerà, R., & Sales-Pardo, M., 2009

• Supervised Learning model

• Pavlov, M., & Ichise, R., 2007

• Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M., 2006

Problem

Background

Related

Work

Workflow

Conclusion

9

Similarity Based Estimation• Use metrics to estimate proximities of pairs of researchers

• Based on those proximities to rank pairs of researchers

• The top pairs of researchers will likely to be the recommendations.

Problem

Background

Related

Work

Workflow

Conclusion

10

Similarity Based Estimation• Network structure based measurement

Some conventions:

Yand X node between Similarity :XYS

X of neighbours ofSet :Γ(X)

Yof neighbours ofSet :Γ(Y)

Ynode of Degree|:Γ(Y)|k(Y)

X node of Degree:|Γ(X)|k(X)

Problem

Background

Related

Work

Workflow

Conclusion

11

Similarity Based Estimation• Common Neighbor:

|(Y) (X)| SXY

XY

Problem

Background

Related

Work

Workflow

Conclusion

12

Similarity Based Estimation• Jaccard’s coefficient:

|)()(||)()(|

YXYXSXY

XY

Problem

Background

Related

Work

Workflow

Conclusion

13

Similarity Based Estimation• Preferential Attachment:

)()( YkXkSXY

XY

Problem

Background

Related

Work

Workflow

Conclusion

14

Similarity Based Estimation• Adamic/Adar:

)()( )(log

1YXZ

XY ZkS

XY

Z

Problem

Background

Related

Work

Workflow

Conclusion

15

Similarity Based Estimation• Shortest Path:

• Defines the minimum number of edges connecting two nodes.

• PageRank:• A random walk on the graph assigning the probability that a node

could be reach. The proximity between a pair of node can be determined by the sum of the node PageRank.

Problem

Background

Related

Work

Workflow

Conclusion

16

Maximum Likelihood Estimation• Predefine specific rules of a network

• Required a prior knowledge of the network

• The likelihood of any non-connected link is calculated according to those rules.

Problem

Background

Related

Work

Workflow

Conclusion

17

Supervised Learning Model• Construct dimensional feature vectors

• Fetch these vectors to classifiers to optimize a target function (training model)

• Link prediction becomes a binary classification

Problem

Background

Related

Work

Workflow

Conclusion

18

Supervised Learning Model

• Related work (Al Hasan, M., Chaoji, V., Salem, S., & Zaki,

M., 2006) using:• Decision Tree• SVM (Linear Kernel)• K nearest neighbor• Multilayer Perceptron• Naives Bayes• Bagging

• Combine many classifiers (Pavlov, M., & Ichise, R., 2007)• Decision stump + AdaBoost• Decision Tree + AdaBoost• SMO + AdaBoost

Problem

Background

Related

Work

Workflow

Conclusion

19

Summary• Similarity based estimation

• Not quite well-perform• Maximum likelihood

• Depend on the network• Supervised learning model

• Perform better than similarity based estimation

Problem

Background

Related

Work

Workflow

Conclusion

20

Workflow

Problem

Background

Related

Work

Workflow

Conclusion

Classifier Model Features

21

Graph Description

• Co-authorship graph:

• Undirected graph G (V , E)

• Node or Vertex ( Author )

• Author ID

• Author Name

• Link or Edge (Co-authorship)

• Pair of author ID

• List of publication year followed by paper title

(Ex: 2004 :”Introduction to …” )

Problem

Background

Related

Work

Workflow

Conclusion

22

Setting up data• Dataset is separated into 2 timing spans: 2000 – 2010

and 2010 – 2013• The first is for training, the latter is for testing.• Currently, there are 134,307 researchers in the network

2000 – 2013.• Crop out authors who are not available in testing period,

remaining 104,265 researchers

Problem

Background

Related

Work

Workflow

Conclusion

23

Setting up data• Choose a subset from 104,265 researchers• Experiment on 937 researchers

2000-2010 2010-2013

Real Network

No of node 104,265 104,265

No of link 413,691 35,558

Experiment Network

No. of node 937 937

No. of link 3093 57

Problem

Background

Related

Work

Workflow

Conclusion

24

Baseline Features

• Extract features from the network structure:

• Local similarity

• Common Neighbor

• Adamic / Adar

• Preferential Attachment

• Jaccard’s coefficient

• Global similarity

• Shortest Path

• PageRank

Problem

Background

Related

Work

Workflow

Conclusion

25

Baseline Features

• Feature for co-authorship network

• Keyword matching (Cohen, S., & Ebel, L., 2013 )

A suggested metric to measure the textual relavancy uses a TF-

IDF based function to determine.

Problem

Background

Related

Work

Workflow

Conclusion

26

Proposed FeaturesProductivity of the authors

Observe the “history” of an authorFor example, at a particular node A:

Problem

Background

Related

Work

Workflow

Conclusion

T2 = 2005T0 = 2000 T1 = 2004 T3= 2006

i=0 i=1 i=2 i=3

n=3m=1

n=4m=2

n=6m=2

n=7m=3

n : No. of shared paperm: No. of collaborators

1m1n

0m2n

1m1n

27

Proposed Features

α : a constant to assign the weight of each time period

0 1

1

1)(

)(i ii

mmTT

i

TTnn

APiTiT

ii

Problem

Background

Related

Work

Workflow

Conclusion

Productivity of the authorsObserve the “history” of an authorThe “productivity” of node A:

28

Training set

• Set up training data

• With n nodes, there is possible links.

• Among those, separate two links

• Positive link: links appear in training years.

• Negative link: the remaining non-existent link in training years.

Note: Avoid bias training by balancing the number of instances between true

and false label.

• Classify all the non-existent links

• Compare with the testing data

2)1( nn

Problem

Background

Related

Work

Workflow

Conclusion

29

Experimental Results

• Measurement of performance

• Precision:

• Recall:

• Harmonic mean:

• New links to predict: 57 links

005.0558826

26

P

45.03126

26

R

009.031558826*2

2621

F

Problem

Background

Related

Work

Workflow

Conclusion

Prediction

True Link False Link

True Link 26 31

False Link 5,588 429,778

30

Result Analysis

• Possible reasons

• Features

• Small set of data – sampling problem

• Instances of the negative links used for training

Problem

Background

Related

Work

Workflow

Conclusion

31

Research Plan• Use weighted graph with parameters:

• No. of papers

• No. of neighbor

• No. of citations

• Focus on features that specifically target the co-authorship network:• Citations

• Institutions

• Enlarge the experiment dataset size

Thank you

Problem

Background

Related

Work

Workflow

Conclusion

32

References• Adamic, L. A., & Adar, E. (2003). Friends and neighbors on the web. Social networks,

25(3), 211-230.• Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M. (2006). Link prediction using supervised

learning. In SDM’06: Workshop on Link Analysis, Counter-terrorism and Security.• Liben‐Nowell, D., & Kleinberg, J. (2007). The link‐prediction problem for social networks.

Journal of the American society for information science and technology, 58(7), 1019-1031.

• Pavlov, M., & Ichise, R. (2007). Finding Experts by Link Prediction in Co-authorship Networks. FEWS, 290, 42-55.

• Murata, T., & Moriyasu, S. (2008). Link prediction based on structural properties of online social networks. New Generation Computing, 26(3), 245-257.

• Guimerà, R., & Sales-Pardo, M. (2009). Missing and spurious interactions and the reconstruction of complex networks. Proceedings of the National Academy of Sciences, 106(52), 22073-22078.

• Bliss, C. A., Frank, M. R., Danforth, C. M., & Dodds, P. S. (2013). An Evolutionary Algorithm Approach to Link Prediction in Dynamic Social Networks. arXiv preprint arXiv:1304.6257.

• Cohen, S., & Ebel, L. (2013). Recommending collaborators using keywords. In Proceedings of the 22nd international conference on World Wide Web companion 959-962.

33

• Link per year of training set is greater than link per year of testing set:• In testing period, only consider “new” collaborations. • Any collaborations between researchers that already has a link will

be disregarded.

2000-2010 2010-2013No of node 937 937No of link 3093 57

34

Results with different classifiersClassifier Precision

(Positive Predictive Value)(%)

Recall(Hit rate)

(%)

F1(Harmonic mean)

(%)

Decision Tree 0.3 24.6 0.5

SMO 0.5 45.6 0.9

Bagging 0.4 28.1 0.7

Naive Bayes 0.2 77.2 0.3

Multilayer Perceptron

0.4 47.3 0.8

35

Proposed Feature• The reason for proposing this feature:

• Keep track of the researcher tendency• Give “bonus” to researcher who tend to collaborate with “new”

colleagues rather than “old” ones• Also give high score for prolific researchers (based on number of

published paper)

36

Stochastic Block Model• Guimerà, R., & Sales-Pardo, M., 2009

Problem

Background

Related

Work

Workflow

Conclusion

lrll QQMA )1()|L(

in isother theand in is node one that such nodes of pairs of No. :

, group between edges of No. :

connected are , group in nodes y that twoprobabilit :

r

l

Q

37

Stochastic Block Model

1

2

3

4

5

6

7

X Y

Problem

Background

Related

Work

Workflow

Conclusion

}}7,6,5,4{},3,2,1{{M

61

65

65

611L

5102

The reliability of an individual link is:

')'()'()'|(

)()|()|1()|1(

dMMpMLMAL

dMMpMALMALAALR xy

xyxy

Date post:	24-Feb-2016
Category:	Documents
Upload:	ciel
View:	47 times
Download:	0 times

Link Prediction in Co-Authorship Network

Documents