Date post: | 11-May-2015 |
Category: |
Technology |
Upload: | marko-rodriguez |
View: | 1,261 times |
Download: | 0 times |
A Model of the Scholarly Community
Marko A. Rodriguez
http://www.soe.ucsc.edu/~okramMarch 30, 2007
MESUR Project
• 2-year project• First half of the project is focused on
ontology development, parsing, and the development of analysis algorithms (metrics).
• Second half of the project is the analysis of our data structure and reporting our findings in the literature.
Outline
• The Data: which and how much data?
• The Model: how do we represent the data?
• The Metrics: how do we quantify the entities in our model?
Terminology
• Groups: journals, proceedings, magazines, newspapers, edited books.
• Units: articles, book chapters, dissertations.• Documents: groups and units.• Usage-Event: the act interacting with an
article. (e.g. getFullText, getAbstract, getReferences) -- expression of user interest.
The Data
• Two types of data:– Bibliographic data: metadata pertaining
groups and units.– Usage data: metadata pertaining to group
and unit usage.
The Data
• Group-level Bibliographic Data– SFX Master List: > 300,000 groups– SFX: > 85,000 group classifications– ISI JCR: > 8,000 indexed groups– ISI JCR: > 50,000,000 group citations– ISI JCR: > 100,000 group classifications
• Unit-level Bibliographic Data– ISI Tapes: > 30,000,000 unit records– ISI Tapes: > 500,000,000 unit citations
The Data
• Usage Data– Los Alamos: > 400,000 1-year– BioMed Central: > 24,000,000 2-years– anonymous: > 1,000,000 5-years– anonymous: > 2,500,000 1-year– anonymous: > 50,000,000 1-week– …
The Data
• The semantic network model is estimated be >10 billion triples (edges).– as of March 2007: 1.2 billion.
The Model
• In order to integrate the various data sets in their various formats, we model all information according to an ontology.
The Model
• RDF, RDFS, OWL [W3C Standards]– Resource Description Framework– Resource Description Framework Schema– Web Ontology Language
• Provides us a standardized language for which to represent our entities and their relationships to one another.
The Model
• In OWL, everything is an owl:Thing--both nodes and edges (analogous to java.lang.Object in Java)
• All owl:Things are represented by a URI.
• An instance of the ontology provides us with a URI triple list data structure:
The Model
• The instance of an OWL ontology resides in a triple store.
The Model
• SPARQL (like SQL, but for triple stores).
SELECT ?c as grandparent WHERE ( ?a childOf ?b) ( ?b childOf ?c )
The Model
Rodriguez, M.A., Bollen, J., Van de Sompel, H., “A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage”, IEEE/ACM Joint Conference on Digital Libraries, Vancouver, 2007.
The Model
The Model
The Model
SELECT ?xWHERE
( ?x rdf:type mesur:Publishes ) ( ?x mesur:hasAuthor lanl:marko )( ?x mesur:hasAuthor lanl:herbertv )
INSERT < _123 rdf:type mesur:Coauthor >INSERT < _123 mesur:hasSource lanl:marko >INSERT < _123 mesur:hasSink lanl:herbertv >INSERT < _123 mesur:hasWeight COUNT(?x) >INSERT < _456 rdf:type mesur:Coauthor >INSERT < _456 mesur:hasSource lanl:herbertv >INSERT < _456 mesur:hasSink lanl:marko >INSERT < _456 mesur:hasWeight COUNT(?x) >
From the Publishes contexts, generate a weighted coauthorship network.
The Model
Phase 1 is looking just at group level usage and bibliographic data
The Metrics
• ISI Impact Factor• Usage Impact Factor
– Bollen J., Van de Sompel, H., “Usage Impact Factor: The Effects of Sample Characteristics on Usage-based Impact Metrics”, [in review], 2007.
• H-Index– Hirsh, J.E., “An index to quantify an individual's scientific research output”,
Proceedings of the National Academy of Science, 102:46, 2005.
• Y-Factor– Bollen J., Rodriguez, M.A., Van de Sompel, H., “Journal Status”, Scientometrics, 69:3,
2006.
• …
The MetricsSELECT ?xWHERE
( ?x rdf:type mesur:Publishes ) ( ?x mesur:hasUnit ?a )( ?x mesur:hasGroup ?b )( ?b mesur:partOf urn:issn:1082-9873 )( ?x mesur:hasTime ?t ) AND
(?t > 2004 AND ?t < 2007)( ?y rdf:type mesur:Citation )( ?y mesur:hasSource ?c )( ?y mesur:hasSink ?a )( ?z rdf:type mesur:Publishes )( ?z mesur:hasUnit ?c )( ?z mesur:hasTime ?u) AND ?u = 2007
SELECT ?yWHERE
( ?y rdf:type mesur:Publishes )( ?y mesur:hasGroup ?a )( ?a mesur:partOf urn:issn:1082-9873 )( ?y mesur:hasTime ?t ) AND
(?t > 2004 AND ?t < 2007)
INSERT < _123 rdf:type mesur:ImpactFactor >INSERT < _123 mesur:hasObject urn:issn:1082-9873 >INSERT < _123 mesur:hasStartTime 2007 >INSERT < _123 mesur:hasEndTime 2007 >INSERT < _123 mesur:hasNumbericValue
(COUNT(?x) / COUNT(?y)) >
From the Publishes and Citation contexts, generate Impact Factor Rankings.
The Metrics
• Eigenvector-based global-rank metrics such as PageRank, Eigenvector centrality, Y-Factor, and relative-rank ‘spreading activation’ algorithms can be calculated in a similar fashion.
Rodriguez, M.A., “Grammar-Based Random Walkers in Semantic Networks”, [in review], 2007.