Comparing the Scientific Impact ofp g pConference and Journal Publications
in Computer Sciencein Computer Science
Erhard RahmErhard Rahm
http://dbs.uni-leipzig.de
A d i P bli hi i E (APE) C f 2008 B liAcademic Publishing in Europe (APE) Conf., 2008, BerlinJan. 23, 2008
Citation analysisCitation analysisCitation analysis is increasingly used to measure scientifici fimpact of
Journals (impact factor)AuthorsAuthorsInstitutions
JCR impact factors limited to journalsJCR impact factors limited to journalsMuch computer science research is published only in conferences
Need to consider citations from / to (refereed) conferencepublications
Citation analysis is a huge data integration problemNeed to automate as much as possible with good data quality
2
MS Libra statistics (Dec 2007)MS Libra statistics (Dec. 2007)
http://libra msra cnhttp://libra.msra.cn
# # # it d # # it d#venues #papers(all)
#cited(all)
#papers(top 100 venues)
#cited(top100venues)
journals 471 321.000 1.655.000 190.000 1.434.000
Conference / workshop series 2.297 585.000 1.752.000 167.000 1.216.000
3
AgendaAgendaMotivationIn-depth comparison for CS publications on databases
Data sourcesC f j l i fConference vs. journal impact factorsCitation skew, rankings (nation, institution)
Data integration of bibliographic web dataData integration of bibliographic web dataMOMA framework for record matchingOnline citation service (OCS) Online citation service (OCS)
Summary
4
Citation analysis of database publications*Citation analysis of database publications*10 years: 1994 – 20035 venues: 5 venues:
2 conference series (ACM SIGMOD, VLDB), 3 journals (ACM TODS, VLDB Journal, Sigmod Record)
Evaluation using 2005 and 2007 citation data
good coverage of CS venuesmanually curated, good qualityno citation counts
many citationsvery good coverage of computer science
hresearchdata quality problems (duplicates, …) due toautomatic information extraction
5* Rahm, E., A. Thor: Citation analysis of database publications, ACM Sigmod Record, Dec. 2005
Further Citation SourcesFurther Citation Sources
ACM Digital Library
6
#citings per source(to papers of considered venues and years)(to papers of considered venues and years)
16000
18000
12000
14000
16000
G l S h l
8000
10000
12000 Google Scholar
MS Libra
Scopus *
6000
8000p
ACM DL
Citeseer
Th ISI**
2000
4000 Thoms. ISI**
0
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
7
*Scopus does not cover VLDB conf** ISI does not cover conferences; VLDBJ /SR since 1998/2000
as of Dec. 2007
Conferences vs JournalsConferences vs. Journals50000 # Citings per venue
# Publications (1994-2003) 40000
45000 (1994-2003)
500
600
30000
35000
GS 2005
300
400
20000
25000GS 2007
20010000
15000
0
100
SIGMOD VLDB VLDB ACM SIGMOD
0
5000
SIGMOD Conf. VLDB Conf. VLDB Journal ACM TODS SIGMOD Record
8
Conf. Conf. Journal TODS RecordJ
Conf vs Journals: #citings per paperConf. vs. Journals: #citings per paper
120
100
120
80
40
60 GS 2005
GS 2007
20
40
0
SIGMOD Conf. VLDB Conf. VLDB Journal ACM TODS SIGMOD Record
9
JCR impact factors for journalsJCR impact factors for journals14
10
12
6
8 VLDB Journal
ACM TODS
2
4SIGMOD Record
Journal impact factor IF(X) = average #citings in year X for a journal
0
1996 1997 1998 1999 2000 2001 2002 2003 2004
article published in the 2 preceding years X-1 and X-2 IF can also be determined for annual conference seriesCan be generalized to articels from k preceding years (e.g. k=5)
10
Can be generalized to articels from k preceding years (e.g. k 5)
GS-based impact factorsp
14 14
GS 2007GS 2005
10
12
10
12
GS 2007 GS 2005
6
8
6
8SIGMOD Conference
VLDB Conf.
VLDB J.
ACM TODS
2
4
2
4
ACM TODS
SIGMOD Record
0
1996 1997 1998 1999 2000 2001 2002 2003 2004
0
1996 1997 1998 1999 2000 2001 2002 2003 2004
Consider only citing GS publications with year (ca. 77%)SIGMOD conf. > VLDB conf. > Journals2007 d t hi h i t f t th 2005 d th i JCR2007 data: higher impact factors than 2005 and than using JCR
11
GS-based impact factors (5 years)GS-based impact factors (5 years)
GS 2007GS 2005
12
14
12
14 GS 2007GS 2005
8
10
SIGMOD Conference
VLDB Conf.8
10
4
6VLDB J.
ACM TODS
SIGMOD Record4
6
0
2
1999 2000 2001 2002 2003 2004
0
2
1999 2000 2001 2002 2003 2004
Impact factors more stable for 5 yearsConferences maintain higher impact than journals
1999 2000 2001 2002 2003 20041999 2000 2001 2002 2003 2004
Conferences maintain higher impact than journals
12
Citation skewCitation skewCitation distribution (splitted by quarters)
2 % f 60 80% 25% top referenced publications → 60-80% citingsSR has highest skew, TODS is most balanced
100%
60%
80%100%
75%
50%
20%
40%50%
25%
Gini
0%
SIGMOD VLDB SIGMODR d
VLDBJ l
TODS
13
Record Journal
Aggregated Citation FrequenciesAggregated Citation Frequencies
based on institution of first authoronly papers with at least 20 citings (w/o self-citings) are considered
14
y p p g ( g )
AgendaAgendaMotivationIn-depth comparison for CS publications on databases
Data sourcesC f j l i fConference vs. journal impact factorsCitation skew, nation ranking, institution ranking
Data integration of bibliographic web dataData integration of bibliographic web dataMOMA framework for record matchingOnline citation service (OCS) Online citation service (OCS)
Summary
15
Matching objects in web sourcesMatching objects in web sources@article{DBLP:journals/vldb/RahmB01,author {Erhard Rahm and Philip A Bernstein} DBLPauthor = {Erhard Rahm and Philip A. Bernstein},title = {A survey of approaches to automatic schema matching.}journal= {VLDB J.}, year = {2001}, ...
DBLP
Google Scholar
Information Fusion
ACM
16
Object matching framework MOMAObject matching framework MOMAMOMA = Mapping based Object Matching*Object consolidation framework
Matching objects from 2 sourcesGeneration of instance mappings (correspondences)Generation of instance mappings (correspondences)Special case: duplicate detection within 1 source (generation of self-mapping)
Key featuresExtensible matcher libraryMapping combination
SourceA SourceA‘ Sim
a1 a‘1 1
‘ 0 9Mapping combinationConstruction of match workflowsStorage of mappings for reuse in
a2 a‘1 0.9
a3 a‘3 0.8
same mapping for authorsg pp g
other match problems
Implemented within iFuice data integration platform
same-mapping for authors
17*Thor, Rahm: MOMA - A Mapping-based Object Matching System. Proc. CIDR, 2007
MOMA ArchitectureMOMA Architecture
A
LDSA
Matcher 1 Mapping Combiner
Match Workflow
Matcher 2A
LDSB
...
MappingCache
MappingOperator
Selec-tion
SameMapping
Matcher n
B
B Operator tion
Matcher implementation
Matcher Library
Match Workflows
Matcher implementation(e.g., Attribut based) Mapping Repository
18
Match Workflows
On-demand citation analysisy
On-demand citation service (OCS)*Wh h d f f X?What are the most cited papers of conference X?What is the average citation number of publications from authorY?F h i bli i & i iFrequent changes, i.e., new publications & new citations
Idea: Combine publication lists, e.g. from DBLP or Pubmed, ith it ti t f GS Cit Swith citation counts, e.g from GS, Citeseer or ScopusDBLP, Pubmed: high bibliographic data qualityGS l f it ti tGS: large coverage of citations counts
Query problem: Given a set of DBLP publications → How tofi d th di GS bli ti ?find the corresponding GS publications?
Query GS and match DBLP-GS
19*Thor, Aumueller, Rahm: Data Integration Support for Mashups. Proc. IIWeb, 2007
Online Citation Service: Result overviewOnline Citation Service: Result overview
Bibliographic data from DBLP
Sum of GS citations
Corresponding GS publicationspublications
20
OCS example: Top conference papersOCS example: Top conference papers
21
OCS example: Top journal papersOCS example: Top journal papers
22
AgendaAgendaMotivationIn-depth comparison for CS publications on databases
Data sourcesC f j l i fConference vs. journal impact factorsCitation skew, nation ranking, institution ranking
Data integration of bibliographic web dataData integration of bibliographic web dataMOMA framework for record matchingOnline citation service (OCS) Online citation service (OCS)
Summary
23
SummarySummaryLarge scientific impact of conference publications in computer sciencecomputer science
Must be considered for a meaningful citation analysisIn some fields, e.g. database research, top conferences receiveIn some fields, e.g. database research, top conferences receivemany more citings than top journals
Impact factors should be extended to major conferences#citings are highly skewed within venues -> need forindividual (per author/organization etc.) impact analysis
not just #publications and general venue impact
Need für improved data integration on heterogeneous datasources (more automatic high data quality) sources (more automatic, high data quality) U Leipzig: new research prototypes for data integration, object matching and on-demand citation analysis
24
object matching and on-demand citation analysis