Date post: | 30-Dec-2015 |
Category: |
Documents |
Upload: | antony-gibbs |
View: | 214 times |
Download: | 1 times |
1
Formal Models for Expert Finding on DBLP Bibliography Data
Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu
Department of Computer Science and EngineeringThe Chinese University of Hong Kong
Dec. 16, 2008
ICDM2008
2
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Introduction Traditional information retrieval Expert finding task
Data mining Data mining
3
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Outline
Introduction Related work Methodology
Modeling Expertise Statistical language model Topic-based model Hybrid model
Experiments Conclusions
4
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Introduction
Expert finding received increased interest W3C collection in 2005 and 2006 (introduced and
used by TREC) CSIRO collection in 2007
Nearly all of the work has been evaluated on the W3C collection
We address the expert finding task in a real world academic field An important practical problem Some special problems and difficulties
II. Introduction
5
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Problems
How to represent the expertise of a researcher? The publications of a researcher
How to identify experts for a given query? Relevance between a query and publications Publications act as the “bridge” between query and experts
What dataset can be used? DBLP bibliography (limited information) Use Google Scholar as a data supplement
How to measure the relevance between a query and docs? Language model, vector space model, etc.
Should we treat each publication equally?
II. Introduction
6
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Our Work
Our setting: DBLP bibliography and Google scholar More than 955,000 articles with over 574,000 authors About 20GB metadata crawled from Google Scholar
Differ from the W3C setting Cover a wider range of topics Contain much more expert candidates
Applications Find experts for consultation on a new research field Assign papers to reviewers automatically Recommend panels of reviews for grant applications
II. Introduction
7
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Related Work
Document model & Candidate model (Balog et al., SIGIR’06 & SIGIR’07)
Hierarchical language models (Petkova and Croft, ICTAI’06)
Voting model (Macdonald and Ounis, CIKM’06) Author-Persona-Topic model (Mimno and
McCallum, KDD’07) …… They do not consider the importance of
documents. Hardly to be used in large-scale expert finding.
8
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Expertise Modeling
Expert finding p(ca|q): what is the probability of a candidate ca being
an expert given the query topic q? Rank candidates ca according to this probability.
Approach: Using Bayes’ theorem,
where p(ca, q) is joint probability of a candidate and a query, p(q) is the probability of a query.
III. Methodology
9
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Expertise Modeling
Problem: How to estimate p(ca, q)? Model 1: Statistical language model
Document-based approach Find out the experts from the associated publications
Model 2: Topic-based model Association between the query with
several similar topics
Model 3: Hybrid model Combination of Model1 and Model2
q
D
ca
T
q ca
D
III. Methodology
10
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
III. Model 1: Statistical language model
Basic Language Model
The probability pl(ca,q):
Language Model Conditionally independent
q
D
ca
Fig1. Baseline model
Find out documents relevant to the query Model the knowledge of an expert from the associated
documents
11
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Weighted Language Model
q
D
ca
p(d)
q
d1 a10.1
0.2
d20.1 a2
0.2
cited by 200
cited by 10
Fig2. A query example Fig3. Weighted model
III. Model 1: Statistical language model
12
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Topic-based Model
Observation: researchers usually describe their expertise as a combination of several topics
Each candidate is represented as a weighted sum of multiple topics Z
Similarity betweenquery and topics
z -> as a queryestimate p
III. Model 2: Topic-based model
Z
q ca
D
Fig4. Topic-based model
13
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Topic-based Model
Information retrieval
1. Introduction to Modern Information retrieval2. Information retrieval3. Modern Information retrieval5. A language modeling approach to information
retrieval7. Information filtering and information retrieval……99. Cross-language information retrieval100. On modeling information retrieval with
probabilistic inference
Topic z
Google Scholar θz
represent
Z
q ca
D
III. Model 2: Topic-based model
14
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Topic-based Model
Challenge: What similar topics would be selected?
T1: Calculate p(q|θz), select the top K ranked topics Assume topics are independent
Ideal similar topics: Include topics from many different subtopics Not include topics with high redundancy Define a conditional probability function to quantify the
novelty and penalize the redundancy of a topic
T2: T3:
III. Model 2: Topic-based model
15
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Topic Selection Algorithm
T2:
T3:
III. Model 2: Topic-based model
16
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Hybrid Model
Aggregate the advantage of the pl and pt
Defined as:
III. Model 3: Hybrid model
17
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Experiments
DBLP Collection Limitation
No abstract and index terms Hard to represent the document
Representation for documents Use Google Scholar for data supplementation
Title as query, crawled top 10 returned records Up to 20 GB metadata (HTML pages) The citation number of the publication
Title (DBLP)
repd
Google Scholar
sup
IV. Experiments
18
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Topic Collection
2,498 well-defined topics from eventseer Crawl the top 100 returned records from Google Scholar
IV. Experiments
19
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Benchmark Dataset
A benchmark dataset with 7 topics and expert lists
IV. Experiments
20
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Evaluation Metrics
Precision at rank n (P@n):
Mean Average Precision (MAP):
Bpref: The score function of the number of non-relevant candidates
IV. Experiments
21
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Preliminary Experiments
Performed on two corpora using basic language model (B1) “Title” corpus: only using the title “GS” corpus: the representation of Google Scholar
Evaluation results on two corpora (%)
More effective to represent d using Google Scholar
IV. Experiments
22
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Model 1: Statistical Language Models
Evaluation results of language modes
Weighted language model B3 and B2 outperform B1 Important to consider the prior probability
IV. Experiments
23
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Model 2: Topic-based Models
Vary the number of topics (K) from 5 to 100 Results by using different values for K.
The number of topics will be cutoff automatically for T2 & T3
IV. Experiments
24
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Model 2: Topic-based Models
Comparison of the three topic-based models
IV. Experiments
25
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Model 3: Hybrid Models
Evaluation results of hybrid model
Hybrid model outperforms the pure language model and topic-based model in most of the metrics
IV. Experiments
26
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Conclusions and Future Work
Conclusions Address expert finding task in a real world academic field Propose a weighted language model Investigate a topic-base model to interpret the expert finding
task Integrate the language model with the topic-based model Demonstrate that hybrid model achieves the best
performance in evaluation results Future work
Take into account other types of information Refine the results by utilizing social network analysis
27
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Q&A
Thanks!
28
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Comparison to Other Systems
Evaluation results of our language models and the method TS
29
Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering
The Chinese University of Hong Kong ICDM 2008
Example results