SubSift web services and workflows for profiling and comparing scientists and their published works

Post on 11-Apr-2017

11 views 0 download

transcript

SubSift web services and workflows for profiling and comparing scientists and their published works

Simon Price, Peter Flach, Sebastian Spiegler, Christopher Bailey and Nikki Rogers

2

Outline of this paper

1. SubSift – submission sifting

2. Background Theory: Vector Space

Model

3. SubSift REST API

4. Demonstration Workflows

5. Conclusions

3

1. SubSift – submission sifting

1. SubSift – submission sifting

2. Background Theory

3. SubSift REST API

4. Demonstration Workflows

5. Conclusions

4

SubSiftSubSift is a prototype application to support academic peer review.

SubSift matches submitted conference/journal papers to potential peer reviewers based on similarity to published works.

Website:http://subsift.ilrt.bris.ac.uk

5

SubSift has been used for...15

6

Contribution of this work

SubSift RESTful web services:• Open Source software (on Google Code)• Hosted open web service at University of Bristol

Re-usable workflows for profiling and comparing scientists and their published works.

Tool for constructing, manipulating and publishing document-centric datasets.

Related Work• SubSift uses techniques more normally associated with

Information Retrieval

• Full text search tools support text matching on large-scale document collections

e.g. Apache Lucene, PostgreSQL, Oracle UltraSearchDesigned for 1:M matching but can also to do Cartesian product M:M matching.

• How SubSift differs:• Exposes detailed metadata throughout.

• Partly a research tool: need to plug in + instrument new algorithms.

• Fewer licensing restrictions and dependencies for open source.

7

8

2. Background Theory: Vector Space Model

1. SubSift – submission sifting

2. Background Theory

3. SubSift REST API

4. Demonstration Workflows

5. Conclusions

9

Vector Space Model (from Information Retrieval)

Vector Space Model consists of:• bag-of-words representation

• cosine similarity

• tf-idf weighting

For a query (q), rank the documents (dj) in collection (D) by descending similarity to the query.

10

Vector Space Model: bag-of-words representation

no. terms in each abstract

no. terms in DBLP author page of each PC member

11

Vector Space Model: cosine similarity

12

Vector Space Model: tf-idf weighting

13

Representational State Transfer (REST)

“RESTful” web services:• URIs to represent resources

• HTTP POST/GET/PUT/DELETE correspond to usualCreate/Read/Update/Delete (CRUD) operations

• Response formats typically include: XML, JSON, CSV

REST is a design pattern for web services based on HTTP using its familiar URIs, requests, responses, authentication, etc.

14

3. SubSift REST API

1. SubSift – submission sifting

2. Background Theory

3. SubSift REST API

4. Demonstration Workflows

5. Conclusions

15

SubSift System Archicture

16

SubSift REST API

17

Profiles

18

Matches

19

SubSift – canonical workflow

20

4. Demonstration Workflows

1. SubSift – submission sifting

2. Background Theory

3. SubSift REST API

4. Demonstration Workflows

5. Conclusions

21

Workflow 1 – Submission Sifting

Workflow 1 – Web 2.0 Client Implementation

22

Workflow 1 – Papers is just a list of URLs (e.g. Yahoo! Pipes)

23

24

Workflow 2 – Finding an Expert

25

Finding an expert

26

Workflow 3 – Visualising Similarity

27

Clustering staff based on homepage similarity

Dendrogram produced in Matlab from SubSift generated similarity matrix

28

Precision-recall at different thresholds

29

Similarity networks

Diagram created by Graphvis from SubSift generated dot file

30

Connectivity

Diagram created by Graphvis from SubSift generated dot file

31

Workflow 4 – Profiling Reading Lists

32

Profiling a research group by its publications

Diagram produced in Wordle using SubSift profile data

33

Workflow 5 – Ranking News Stories

34

And finally...

Future Work

• Scaling-up• Currently a small-scale web application running on modest

hardware.

• Plans to migrate to a larger-scale HPC application at Bristol.

• ExaMiner project• Mining and mapping the University of Bristol’s research landscape.

• Crawling the University’s web pages to profile and visualise research interests of and similarities between faculty, departments, research groups and researchers.

• Plans to apply to websites of other Universities.

35

36

5. Conclusions

1. SubSift – submission sifting

2. Background Theory

3. SubSift REST API

4. Demonstration Workflows

5. Conclusions

37

Conclusion• SubSift Services useful outside of peer review domain

• Workflows for profiling/comparing scientists Promising e-Science and e-Research use cases for profiling and

comparing scientists and their published works.

• Tool for constructing, manipulating and publishing document-centric datasets E.g. information retrieval, data mining, pattern analysis research. Publication of datasets in this way supports reproducibility of

science. Connects data through Linked Data and the Semantic Web.