A Model of the Scholarly Community

A Model of the Scholarly Community

Marko A. Rodriguez

http://www.soe.ucsc.edu/~okramMarch 30, 2007

http://www.soe.ucsc.edu/~okram

MESUR Project

• 2-year project• First half of the project is focused on

ontology development, parsing, and the development of analysis algorithms (metrics).

• Second half of the project is the analysis of our data structure and reporting our findings in the literature.

Outline

• The Data: which and how much data?

• The Model: how do we represent the data?

• The Metrics: how do we quantify the entities in our model?

Terminology

• Groups: journals, proceedings, magazines, newspapers, edited books.

• Units: articles, book chapters, dissertations.• Documents: groups and units.• Usage-Event: the act interacting with an

article. (e.g. getFullText, getAbstract, getReferences) -- expression of user interest.

The Data

• Two types of data:– Bibliographic data: metadata pertaining

groups and units.– Usage data: metadata pertaining to group

and unit usage.

The Data

• Group-level Bibliographic Data– SFX Master List: > 300,000 groups– SFX: > 85,000 group classifications– ISI JCR: > 8,000 indexed groups– ISI JCR: > 50,000,000 group citations– ISI JCR: > 100,000 group classifications

• Unit-level Bibliographic Data– ISI Tapes: > 30,000,000 unit records– ISI Tapes: > 500,000,000 unit citations

The Data

• Usage Data– Los Alamos: > 400,000 1-year– BioMed Central: > 24,000,000 2-years– anonymous: > 1,000,000 5-years– anonymous: > 2,500,000 1-year– anonymous: > 50,000,000 1-week– …

The Data

• The semantic network model is estimated be >10 billion triples (edges).– as of March 2007: 1.2 billion.

The Model

• In order to integrate the various data sets in their various formats, we model all information according to an ontology.

The Model

• RDF, RDFS, OWL [W3C Standards]– Resource Description Framework– Resource Description Framework Schema– Web Ontology Language

• Provides us a standardized language for which to represent our entities and their relationships to one another.

The Model

• In OWL, everything is an owl:Thing--both nodes and edges (analogous to java.lang.Object in Java)

• All owl:Things are represented by a URI.

• An instance of the ontology provides us with a URI triple list data structure:

The Model

• The instance of an OWL ontology resides in a triple store.

The Model

• SPARQL (like SQL, but for triple stores).

SELECT ?c as grandparent WHERE ( ?a childOf ?b) ( ?b childOf ?c )

The Model

Rodriguez, M.A., Bollen, J., Van de Sompel, H., “A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage”, IEEE/ACM Joint Conference on Digital Libraries, Vancouver, 2007.

The Model

The Model

The Model

SELECT ?xWHERE

( ?x rdf:type mesur:Publishes ) ( ?x mesur:hasAuthor lanl:marko )( ?x mesur:hasAuthor lanl:herbertv )

INSERT < _123 rdf:type mesur:Coauthor >INSERT < _123 mesur:hasSource lanl:marko >INSERT < _123 mesur:hasSink lanl:herbertv >INSERT < _123 mesur:hasWeight COUNT(?x) >INSERT < _456 rdf:type mesur:Coauthor >INSERT < _456 mesur:hasSource lanl:herbertv >INSERT < _456 mesur:hasSink lanl:marko >INSERT < _456 mesur:hasWeight COUNT(?x) >

From the Publishes contexts, generate a weighted coauthorship network.

The Model

Phase 1 is looking just at group level usage and bibliographic data

The Metrics

• ISI Impact Factor• Usage Impact Factor

– Bollen J., Van de Sompel, H., “Usage Impact Factor: The Effects of Sample Characteristics on Usage-based Impact Metrics”, [in review], 2007.

• H-Index– Hirsh, J.E., “An index to quantify an individual's scientific research output”,

Proceedings of the National Academy of Science, 102:46, 2005.

• Y-Factor– Bollen J., Rodriguez, M.A., Van de Sompel, H., “Journal Status”, Scientometrics, 69:3,

2006.

• …

The MetricsSELECT ?xWHERE

( ?x rdf:type mesur:Publishes ) ( ?x mesur:hasUnit ?a )( ?x mesur:hasGroup ?b )( ?b mesur:partOf urn:issn:1082-9873 )( ?x mesur:hasTime ?t ) AND

(?t > 2004 AND ?t < 2007)( ?y rdf:type mesur:Citation )( ?y mesur:hasSource ?c )( ?y mesur:hasSink ?a )( ?z rdf:type mesur:Publishes )( ?z mesur:hasUnit ?c )( ?z mesur:hasTime ?u) AND ?u = 2007

SELECT ?yWHERE

( ?y rdf:type mesur:Publishes )( ?y mesur:hasGroup ?a )( ?a mesur:partOf urn:issn:1082-9873 )( ?y mesur:hasTime ?t ) AND

(?t > 2004 AND ?t < 2007)

INSERT < _123 rdf:type mesur:ImpactFactor >INSERT < _123 mesur:hasObject urn:issn:1082-9873 >INSERT < _123 mesur:hasStartTime 2007 >INSERT < _123 mesur:hasEndTime 2007 >INSERT < _123 mesur:hasNumbericValue

(COUNT(?x) / COUNT(?y)) >

From the Publishes and Citation contexts, generate Impact Factor Rankings.

The Metrics

• Eigenvector-based global-rank metrics such as PageRank, Eigenvector centrality, Y-Factor, and relative-rank ‘spreading activation’ algorithms can be calculated in a similar fashion.

Rodriguez, M.A., “Grammar-Based Random Walkers in Semantic Networks”, [in review], 2007.

Conclusion

• Thanks for your time…Good life.

http://www.mesur.org

http://www.mesur.org/

Date post:	11-May-2015
Category:	Technology
Upload:	marko-rodriguez
View:	1,261 times
Download:	0 times

A Model of the Scholarly Community

Technology