Post on 22-Aug-2020
transcript
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
The MESUR project: an overview and update
Johan Bollen
Indiana University
School of Informatics and Computing
Center for Complex Networks and System Research
jbollen@indiana.edu
Acknowledgements:
Herbert Van de Sompel (LANL), Marko A. Rodriguez (LANL), Ryan Chute (LANL),
Lyudmila L. Balakireva (LANL), Aric Hagberg (LANL), Luis Bettencourt (LANL)
Research supported by the NSF and Andrew W. Mellon Foundation.
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
The scientific process: the importance of early indicators
(Egghe & Rousseau, 2000; Wouters, 1997)
(Brody, Harnad, & Carr 2006),
Citation: final products
• Publication delays
• Focus on publications
• Focus on authors
Usage data
• Scale, cf. Elsevier downloads
(+1B) vs. Wos citations (650M)
• Immediate, early stages
• Variety of resources and actors
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
The Promise of Usage Data
* Interactions be recorded for all digital scholarly content, i.e. papers,
journals, preprints, blog postings, datasets, chemical structures,
software, …
• Not just for ~ 10,000 journals
• Interactions reflects the activities of all users of scholarly information,
not only of scholarly authors
• Interactions are recorded starting immediately after publication
• Not once read and cited (think publication delays)
• Rapid indicator of scholarly trends
• So the interest in usage data from projects such as COUNTER,
Citebase, IKS and MESUR should not come as a surprise!
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
Impact metrics: why more is not necessarily better…
So you want to know who’s best?
83M > 50K
Silly?
We do the same in scholarly evaluation!
• Count citations, calculate IFs
• More citations > less citations
BUT that is so last century:
• relationships matter more than counts
(cf Google)
• Web 2.0: social network thinking
? REM, Teenage
Fanclub, Placebo, This
Mortal Coil, Wilco
Who, Kinks,
Byrds, Beatles
Crucial distinction: data vs. statistics
• Data: who, what, when, how, …. Keep info on sequence and context.
• Statistics: what, how much. Loss of most sequence and context.
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
Network-Based Metrics
We have 50 years of network science available to us
• A wide variety of metrics has been proposed to characterize
networks, and to assess the importance of nodes in a network
• E.g. social network analysis, small world graphs, graph theory,
social modeling
• So when defining metrics for scholarly communication (clearly a
network), we should probably leverage network science
• Cf. Google’s PageRank versus Alta Vista’s statistical ranking
• A network (and hence a network-based metric) takes context into
account; a statistical count does not.
• Readings:
• Barabasi (2003) Linked.
• Wasserman (1994). Social network analysis.
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
Classes of metrics:
• Degree
• Shortest path
• Random walk
• Distribution
Degree
• In-degree
• Out-degree
Shortest path
• Closeness
• Betweenness
• Newman
Random walk
• PageRank
• Eigenvector
Distribution
• In-degree entropy
• Out-degree entropy
• Bucket Entropy
Network-Based Metrics
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
PageRank computed on Citation Network
2003 JCR, Science Edition
5709 journals
Johan Bollen, Marko A. Rodriguez, and Herbert Van de Sompel. Journal status. Scientometrics, 69(3), December 2006 (DOI:10.1007/s11192-006-0176-z)
Philip Ball. Prestige is into journal ratings. Nature 439, 770-771, February 2006 (DOI:10.1038/439770a)
Cf: http://www.eigenfactor.org/
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
1. Create very large-scale reference data set: representative sample?
a) Usage, citation and bibliographic data combined
b) Various communities, various collections
2. Investigate validity of usage data and usage-based metrics – focus
on journals:
a) Is there any significant structure in usage data?
b) Compute a variety of journal metrics for usage data & cross-
validate with other journal metrics, e.g. citation-based IF
3. Deploy tools to explore usage-based journal metrics
MESUR: A Thorough, Scientific Approach
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
Timeline and development
• 2006-2008:
o Andrew W. Mellon Foundation
o Digital Library Research and Prototyping team, Los Alamos National
Laboratory
o Collection of large-scale usage data from some of world’s most
significant publishers, aggregators and institutional consortia
o Feasibility: Usage data, usage-based network models of science,
usage-based impact metrics
• 2009 – infinity and beyond:
o NSF funding (SciSIP, 2009-2012)
o Indiana University, School of Informatics and Computing
o Transfer process underway
o Continuation of MESUR data collection and scientific work
o Focus on studying longitudinal phenomena: innovation, trends,
community-acceptance of novel metrics
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
Presentation structure
1. MESUR’s Usage reference data set
2. Mapping scientific activity
3. Metrics survey
4. Future research
5. Discussion
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
Creating the MESUR usage reference data set
2006-2008: Collaborating publishers, aggregators and institutional consortia:
• BMC, Blackwell, UC, CSU (23), EBSCO, ELSEVIER, EMERALD, INGENTA, JSTOR,
LANL, MIMAS/ZETOC, THOMSON, UPENN (9), UTEXAS
• Scale:
o > 1,000,000,000 usage events, and growing…
o +50M articles, +-100,000 serials
• Period: 2002-2007, but mostly 2006
1B
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
Data normalization and ingestion
Minimal requirements for all usage data
• Unique usage events (article level)
• Fields: unique session ID, date/time, unique document ID and/or metadata, request type
• Note difference with usage statistics
2007 9 1 0 0 1 CFA cffoe A172080.N1.Vanderbilt.Edu unknown AST A 1996SPIE.2828..64S http://foe.edu/abs/1996SPIE.2828..64S http://www.google.com
2007 9 1 0 0 1 CFA cffoe 210.94.41.89 unknown PHY A 2007ApPhL.90a2120C http://foe.edu/abs/2007ApPhL.90a2120C http://www.google.co.kr
2007 9 1 0 0 1 CFA cffoe 24-196-228-125.dhcp.gwnt.ga.charter.com unknown AST A 2000ASPC.213.333S http://foe.edu/abs/2000bioa.conf.333S http://scholar.google.com
2007 9 1 0 0 4 CFA cffoe 163.152.35.114 4700387eae PHY A 1993WRR..29.133S http://foe.edu/abs/1993WRR..29.133S http://scholar.google.com
2007 9 1 0 0 6 CFA cffoe pd9e980fc.dip0.t-ipconnect.de 45f0c69881 AST X 2007AN..328.841H http://arXiv.org/abs/0708.1863 http://foe.edu
2007 9 1 0 0 1 CFA cffoe A172080.N1.Vanderbilt.Edu unknown AST A 1996SPIE.2828..64S http://foeabs.edu/abs/1996SPIE.2828..64S http://www.google.com
2007 9 1 0 0 1 CFA cffoe 210.94.41.89 unknown PHY A 2007ApPhL.90a2120C http://foeabs.edu/abs/2007ApPhL.90a2120C http://www.google.co.kr
2007 9 1 0 0 1 CFA cffoe 24-196-228-125.dhcp.gwnt.ga.charter.com unknown AST A 2000ASPC.213.333S http://foeabs.edu/abs/2000bioa.conf.333S http://scholar.google.com
2007 9 1 0 0 4 CFA cffoe 163.152.35.114 4700387eae PHY A 1993WRR..29.133S http://foeabs.edu/abs/1993WRR..29.133S http://scholar.google.com
2007 9 1 0 0 6 CFA cffoe pd9e980fc.dip0.t-ipconnect.de 45f0c69881 AST X 2007AN..328.841H http://arXiv.org/abs/0708.1863 http://foeabs.edu
2007 9 1 0 0 6 CFA cffoe foel25144.4u.com.gh 47002f8eda PHY A 2002AGUFM.S21A0965M http://foeabs.edu/abs/2002AGUFM.S21A0965M http://www.google.com
2007 9 1 0 0 6 CFA cffoe 66-215-171-214.dhcp.ccmn.ca.charter.com 4681d22a6f AST A 2001P&SS..49.657R http://foeabs.edu/cgi-bin/bib_query?bibcode=2001P%26SS..49.657R http://cfa
2007 9 1 0 0 7 CFA cffoe nat-ptouser3.uspto.gov unknown PHY A 2005ApPhL.86g2106M http://foeabs.edu/abs/2005ApPhL.86g2106M http://www.google.com
2007 9 1 0 0 7 CFA cffoe cpe-71-65-25-115.ma.res.rr.com unknown PHY A 1980SPIE.205.153S http://foeabs.edu/abs/1980SPIE.205.153S http://www.google.com
2007 9 1 0 0 7 CFA cffoe customer3491.pool1.unallocated-106-0.orangehomedsl.co.uk unknown PHY A 1983ElL..19.883V http://foeabs.edu/abs/1983ElL..19.883V http://www.google.co.uk
2007 9 1 0 0 8 CFA cffoe Uranus.seas.ucla.edu 46672d96b2 PHY A 1966Phy..32.385K http://foeabs.edu/abs/1966Phy..32.385K http://www.google.com
2007 9 1 0 0 9 CFA cffoe 75-121-173-37.dyn.centurytel.net 46cf1fd8a6 AST D 1984ApJS..56.257J http://vizier.cfa.edu/viz-bin/VizieR?-source=III/92/ http://foeabs.edu
2007 9 1 0 0 13 CFA cffoe foel17-18.kln.forthnet.gr unknown AST A 1987cosm.book...C http://foeabs.edu/abs/1987cosm.book...C http://www.google.gr
2007 9 1 0 0 15 CFA cffoe hades.astro.uiuc.edu 46f707564d PRE A 2007arXiv0707.3146N http://foeabs.edu/abs/2007arXiv0707.3146N http://foeabs.edu
2007 9 1 0 0 17 CFA cffoe ool-43554752.dyn.optonline.net unknown PHY A 2000PhTea.38.132K http://foeabs.edu/abs/2000PhTea.38.132K http://www.google.com
2007 9 1 0 0 17 CFA cffoe c-68-33-176-222.hsd1.md.comcast.net unknown GEN A 1994RSPSB.256.177M http://foeabs.edu/abs/1994RSPSB.256.177M http://www.google.com
2007 9 1 0 0 19 CFA cffoe 74-36-139-46.dr02.brvl.mn.frontiernet.net unknown AST T 2002SPIE.4767.114W http://foeabs.edu/cgi-bin/nph-abs_connect?bibcode=2002SPIE.4767&db_key=ALL&sort=BIBCODE&a
2007 9 1 0 0 19 CFA cffoe c-76-16-53-120.hsd1.il.comcast.net 46f667b71b AST F 1916PA...24.613L http://articles.foeabs.edu/cgi-bin/nph-iarticle_query?1916PA...24.613L&data_type=PDF_HIGH&w
2007 9 1 0 0 20 CFA cffoe 74-39-37-62.nas03.roch.ny.frontiernet.net unknown PHY E 2007JSTEd.tmp..29B http://dx.doi.org/10.1007/s10972-007-9067-2 http://foeabs.edu
2007 9 1 0 0 22 ANU bio-mirror uatu-virtual1.anu.edu.au 46f9e8f87f AST A 2006ApJ..647.128E http://foe.grangenet.net/abs/2006ApJ..647.128E http://foe.grangenet.net
2007 9 1 0 0 22 CFA cffoe fw.hia.nrc.ca 46f1531d59 AST A 2002P&SS..50.745H http://foeabs.edu/abs/2002P%26SS..50.745H http://foeabs.edu
2007 9 1 0 0 22 CFA cffoe 24-117-0-220.cpe.cableone.net unknown AST A 1984BITA..15.268S http://foeabs.edu/abs/1984BITA..15.268S http://www.google.com
2
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
Presentation structure
1. MESUR’s Usage reference data set
2. Mapping scientific activity
3. Metrics survey
4. Future research
5. Discussion
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
Data set: subset of MESUR
• Common time period:
o March 1st 2006 - February 1st 2007
o Thomson Scientific (Web of Science),
Elsevier (Scopus), JSTOR, Ingenta,
University of Texas (9 campuses, 6
health institutions), and California State
University (23 campuses)
• 346,312,045 usage events
• 97,532 serials (many of which not
journals)
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
How to generate a usage network.
Same session ~ documents relatedness
• Same session, same user: common interest
• Frequency of co-occurrence = degree of
relationship
• Normalized: conditional probability
Usage data is on article level:
• Works for journals and articles
• Anything for which usage was recorded
Note: not something we invented: association rule
learning in data mining.
Beer and diapers!
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
Johan Bollen, Herbert Van de Sompel, Aric Hagberg,Luis
Bettencourt, Ryan Chute, Marko A. Rodriguez, Lyudmila
Balakireva. Clickstream data yields high-resolution maps
of science. PLoS One, February 2009.
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
Network science for impact metrics.
: Number of geodesics between vi and vj
Betweenness centrality
PR(vi): PageRank of node vi
O(vj): out-degree of journal vj
N: number of nodes in network
L: dampening factor
PageRank
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
Presentation structure
1. MESUR’s Usage reference data set
2. Mapping scientific activity
3. Metrics survey
4. Future research
5. Discussion
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
A variety of impact metrics
Note:
• Metrics can be calculated
both on citation and
usage data
• “Frequentist”
o Citation and usage
rates
• “Structural”
o Citation graph, e.g.
2005 JCR
o Usage graph, e.g.
created by MESUR
• H-index, G-index, SJR,
etcWhat do they MEAN?
What facets of impact do they represent?
Which are best suited?
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
Set of metrics calculated on MESUR data set
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
The MESUR Metrics Map
RATE METRICS
TOTAL CITES
PAGERANK(S)
BETWEENNESS
USAGE METRICS
Johan Bollen, Herbert Van de Sompel, Aric
Hagberg and Ryan Chute. A Principal
Component Analysis of 39 Scientific Impact
Measures. PLoS ONE, June 2009. URL:
http://dx.plos.org/10.1371/journal.pone.0006
022.
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
Presentation structure
1. MESUR’s Usage reference data set
2. Mapping scientific activity
3. Metrics survey
4. Future research
5. Discussion
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
Samples of future work (can be skipped)
• Longitudinal studies:
o Network changes over time: collaboration with Carl Bergstrom (UW)
o Prediction of innovation using random walk models
• Logistics:
o Expand existing data set: focus on standardization, repeatability
o Establish continued funding, good home for project
o “Center” model: rather than data->scientists, scientists->data
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
Animated maps: tracing bursts of scientific activity
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
Coordinated bursts
1
23
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
Coordinated bursts
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
MESUR Mapping and ranking services
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
MESUR Mapping and ranking services
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
MESUR Mapping and ranking services
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
MESUR Mapping and ranking services
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
Presentation structure
1. MESUR’s Usage reference data set
2. Mapping scientific activity
3. Metrics survey
4. Future research
5. Discussion
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
MESUR: the good ...
After 3 years of MESUR:
• Scientific exploration of metrics for scholarly evaluation
• Creation of large-scale reference data set
• Mapping science from the viewpoint of users: there is structure!
• Variety of metrics that cover various aspects of scholarly impact and prestige
• MESUR dataset contains many more pearls for future research
• Foundation for future continued research program:
• Longitudinal studies
• Models of collective behavior of scientists
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
Scalability of the approach:
• Lengthy negotiations to obtain log data
• No infrastructure standards (yet): Recording, aggregating, normalization, ingestion, de-duplication,…
• No generally accepted policies: privacy, property, …
• No census data: when is a sample large and representative enough?
Quality control:
• Bots, Crawlers (detectable but never perfect)
• Cheating, manipulation (easier with usage statistics than network metrics)
Acceptance:
• Network-based usage metrics require session information. This is overlooked! As a result, will we end up with usage-based statistics only?
• “As simple as possible, but not more simple!”
MESUR: the bad and the ugly …
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
“
Registration is now open for "Scholarly Evaluation Metrics: Opportunities and Challenges", a one-day NSF-funded workshop that will take place in the Renaissance Washington DC Hotel on Wednesday, December 16th 2009. Participation in this workshop is limited to 50 people. Registration is free at http://informatics.indiana.edu/scholmet09/registration.html.
The topic of the workshop is the future of scholarly assessment approaches, including organizational, infrastructural, and community issues. The overall goal is to identify requirements for novel assessment approaches, several of which have been proposed in recent years, to become acceptable to community stakeholders including scholars, academic and research institutions, and funding agencies. The impressive group of speakers and panelists for the workshop includes representatives from each of these constituencies.
Further details are available at http://informatics.indiana.edu/scholmet09/announcement.html
Workshop organizers:
Johan Bollen (jbollen@indiana.edu),
Herbert Van de Sompel (hvdsomp@gmail.com) and
Ying Ding (dingying@indiana.edu)
“
Increasing community involvement
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
Planning process underway to establish sustainable, open, community supported infrastructure.
New support from Andrew W. Mellon foundation to figure it all out.
Moving towards community involvement
Logistics:
Data aggregation
Normalization
Data-related services
Data management
Science:
Metrics
Analysis
Prediction
Services
Ranking
Assessment
Mapping
=More than sum of parts:
• Each component supports the other
• Various business and funding models
• Generate added value on all levels
Can fundamentally change scholarly
communication
Indiana University
SPARC/LIBER: June 29th, 2010
School of Informatics and Computing
Some relevant publications.
Johan Bollen, Herbert Van de Sompel, Aric Hagberg, Luis Bettencourt, Ryan Chute, Marko A. Rodriguez, Lyudmila Balakireva. Clickstream data yields high-resolution maps of science. PLoS One, March 2009 (In Press)
Johan Bollen, Herbert Van de Sompel, Aric HagBerg, Ryan Chute. A principal component analysis of 39 scientific impact measures.arXiv.org/abs/0902.2183
Johan Bollen, Marko A. Rodriguez, and Herbert Van de Sompel. Journal status. Scientometrics, 69(3), December 2006 (arxiv.org:cs.DL/0601030)
Johan Bollen, Herbert Van de Sompel, and Marko A. Rodriguez. Towards usage-based impact metrics: first results from the MESUR project. In Proceedings of the Joint Conference on Digital Libraries, Pittsburgh, June 2008
Marko A. Rodriguez, Johan Bollen and Herbert Van de Sompel. A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage, In Proceedings of the Joint Conference on Digital Libraries, Vancouver, June 2007
Johan Bollen and Herbert Van de Sompel. Usage Impact Factor: the effects of sample characteristics on usage-based impact metrics. (cs.DL/0610154)
Johan Bollen and Herbert Van de Sompel. An architecture for the aggregation and analysis of scholarly usage data. In Joint Conference on Digital Libraries (JCDL2006), pages 298-307, June 2006.
Johan Bollen and Herbert Van de Sompel. Mapping the structure of science through usage. Scientometrics, 69(2), 2006.
Johan Bollen, Herbert Van de Sompel, Joan Smith, and Rick Luce. Toward alternative metrics of journal impact: a comparison of download and citation data. Information Processing and Management, 41(6):1419-1440, 2005.