Data Base and Data Mining Group of Politecnico di Torino
DBMG
Identifying collaborations among researchers: a pattern-based approach
Tokyo, August 11 2017
Elena Baralis, Luca Cagliero, Mohammad Reza Kavoosifar, Paolo Garza
2DBMG
Outline Analyzed data Addressed problem Related works The pattern-based approach Experimental results Conclusions and future work
3DBMG
Analyzed data Electronic versions of
scientific publications Available through Digital
Libraries (DL) and onlinedatabases, e.g., PubMed,OMIM
Publishers’ digital librariesgive limited or freeaccess to conference proceedings books Journal papers
4DBMG
Analyzed data How to find the
scientific publications ofmajor interest? Topic-driven searches Author-driven searches All of the above
5DBMG
Analyzed data What are the most
relevant publicationswritten by an author? Author-driven query Publications are ranked
by number of received
citations date popularity (e.g., number of
reads)
6DBMG
Analyzed data What are the most
relevant publicationswritten by an author ona specific topic? Author- and topic-driven
query The author’s publications
covering the topic underanalysis are selected andranked
7DBMG
Analyzed data What are the most
fruitful collaborationsamong multipleauthors? No deterministic solution Hard to solve using
simple queries For each topic? For each combination of
authors? How to combine and rank
the results?
9DBMG
Addressed problem Expected result (automatically inferred from
DL data) A list of significant topics For each topic the groups of researchers who have
produced most relevant publication records Groups of researchers of arbitrary size Ranked lists (of both topics and groups)
10DBMG
Related works Citation content analysis
Analyze position in the text and semantics of citations E.g., [Zhang et al., JASIST 2013], [Kim et al., BIRNDL 2016]
Researcher networks Profile researchers and compute similarities E.g., ArnetMiner [Tang et al., KDD 2008]
Reviewer assignment Assist editors in the peer review of scientific papers Given a pool of candidate reviewers, what papers should be
assigned to each of them? E.g., [Kou et al., SIGMOD 2015], [Kou et al., VLDB 2015]
11DBMG
The pattern-based solution Unsupervised data mining approach
Apply an itemset mining algorithm Discover patterns representing the most significant
correlations between authors and topics The Authors – Topic Patterns (ATP)
Group and rank ATPs to ease manual exploration
12DBMG
Weighted itemset mining Weighted transactional data
Set of weighted transactions Each transaction represents a different publication Each transaction consists of a set of items Items are either authors or topics Transactions are weighted by a relevance weight (e.g.,
the number of received citations)
13DBMG
Weighted itemset mining Weighted itemsets
A weighted k-itemset is a set of k items that co-occurin a weighted dataset (e.g., {(Author: Smith L.),(Topic,Z)} is a 2-itemset)
The traditional support of an itemset in a weighteddataset is its observed frequency of occurrence, i.e., itdisregards transaction weights
Extraction task Discover all itemsets whose support is above a given
(user-specified) threshold
14DBMG
Authors-Topic Pattern Pattern definition and characteristics
An ATP is a combination of set of author items (one ormore) and a topic items k items that co-occur in aweighted dataset e.g., {(Author: Smith L.),(Topic, Z)}
The influence of an ATP I in a dataset D is a linearcombination of the number of citations C(pj) of thepublications pj associated with transactions in D
15DBMG
ATP mining Extraction task
Extract all ATPs whose influence is above a giventhreshold mininf FP-Growth-like extraction [Cagliero & Garza TKDE 2013]
ATP clustering and ranking ATPs are grouped by topic and length (i.e., the
collaboration group size) and ranked by decreasinginfluence
16DBMG
Case study Real context
Discovery of research collaborations who have conductedinfluential studies on genomics and genetics
Data acquired from the open Online Mendelian Inheritancein Man (OMIM) Digital Library
Part of the National Center for Biotechnology Information(NCBI) system of databases
For each genetic disorder The list of related publications The authors of each publication The set of genes correlated with the disorder
17DBMG
Case study Pattern validation
For each gene and genetic disorder pick the top-5 ATPs Research question A: Are the research team and the topic
really correlated with each other? Comparison with top-3 publications returned by author-driven queries on
PubMed Research question B: Among the topics addressed by the team,
is the topic indicated in the pattern the most influential one? Comparison with top-ranked publication according to PubMed search
19DBMG
Conclusions and future work Summarizing…
Knowledge discovery from DL data Pattern-based solution to identify fruitful collaborations
between researchers New type of interpretable pattern modelling correlation
between authors and topics Promising results on data related to genomics/genetics
Ongoing work Integration of more advanced topic detection algorithms Differentiate authors’ contribution based on their position
in the author’s list Application to reviewer assignment problem