+ All Categories
Home > Documents > Metadata characteristics as predictors for editor selectivity in a current awareness service

Metadata characteristics as predictors for editor selectivity in a current awareness service

Date post: 22-Jan-2016
Category:
Upload: karim
View: 20 times
Download: 0 times
Share this document with a friend
Description:
Thomas Krichel & Nisa Bakkalbasi 2005-10 -31. Metadata characteristics as predictors for editor selectivity in a current awareness service. outline. Background to work that we did RePEc (Research Papers in Economics) NEP: New Economics Papers The research Theory Method Results - PowerPoint PPT Presentation
Popular Tags:
45
Metadata characteristics as predictors for editor selectivity in a current awareness service Thomas Krichel & Nisa Bakkalbasi 2005-10-31
Transcript
Page 1: Metadata characteristics as predictors for editor selectivity in a current awareness service

Metadata characteristics as predictors for editor selectivity in a current

awareness service

Thomas Krichel & Nisa Bakkalbasi2005-10-31

Page 2: Metadata characteristics as predictors for editor selectivity in a current awareness service

outline

• Background to work that we did– RePEc (Research Papers in Economics)– NEP: New Economics Papers

• The research– Theory– Method– Results

• Other work done for NEP.

Page 3: Metadata characteristics as predictors for editor selectivity in a current awareness service

RePEc• Digital library for academic Economics. It

collects descriptions of– economics documents (working papers,

articles etc)– collections of those documents– economists– collections of economists

• Pionneering effort to create a relational dataset describing an academic discipline as a whole.

• The data is freely available.

Page 4: Metadata characteristics as predictors for editor selectivity in a current awareness service

RePEc principle• Many archives

– Archives offer metadata about digital objects or authors and institutions data.

• One database

• Many services

– Users can access the data through many interfaces.

– Providers of archives offer their data to all interfaces at the same time. This provides for an optimal distribution.

Page 5: Metadata characteristics as predictors for editor selectivity in a current awareness service

it's the incentives, stupid

• RePEc applies the ideas of open source to the construction of bibliographic dataset. It provides an open library.

• The entire system is constructed in such a way as to be sustainable without monetary exchange between participants.

Page 6: Metadata characteristics as predictors for editor selectivity in a current awareness service

some history

• Thomas Krichel in the early 1990s dreamed about a current awareness service for working paper. It would later have electronic papers.

• In 1993 he made the first economics working paper available online.

• In 1997 he wrote the key protocols that “govern” RePEc.

Page 7: Metadata characteristics as predictors for editor selectivity in a current awareness service

RePEc is based on 500+ archives

• WoPEc• EconWPA• DEGREE• S-WoPEc• NBER• CEPR• Elsevier

• US Fed in Print• IMF• OECD• MIT• University of Surrey• CO PAH• Blackwell

Page 8: Metadata characteristics as predictors for editor selectivity in a current awareness service

to form a 340+k item dataset

161,000 working papers

180,000 journal articles

1,300 software components

1,200 book and chapter listings

8,000 author contact & publication listings

9,100 institutional contact listings

more records than arXiv.org

Page 9: Metadata characteristics as predictors for editor selectivity in a current awareness service

RePEc is used in many services

• EconPapers

• NEP: New Economics Papers

• Inomics• RePEc author service• Z39.50 service by the DEGREE

partners

• IDEAS

• RuPEc

• EDIRC

• LogEc

• CitEc

Page 10: Metadata characteristics as predictors for editor selectivity in a current awareness service

NEP: New Economics Papers• This is a set of current awareness reports

on new additions to the working paper stock only. Journal articles would be too old.

• Founded by Thomas Krichel in 1998.

• Supported by the Economics department at WUStL.

• Initial software was written by Jose Manuel Barrueco Cruz.

• First general editor was John S. Irons.

Page 11: Metadata characteristics as predictors for editor selectivity in a current awareness service

why NEP

• Public aim: Current awareness if well done, can be an important service in its own right. It is sheltered from the competition of general search engines.

• Private aim: It is useful to have some, even though limited classification information. This should be useful in performance measures within subject areas.

Page 12: Metadata characteristics as predictors for editor selectivity in a current awareness service

modus operandi: stage 1

• The general editor uses a computer program who gathers all the new additions to the working paper stock. This is usually done weekly.

• S/he filters out new descriptions of old papers – date field– handle heuristics

• The result is an issue of the nep-all report.

Page 13: Metadata characteristics as predictors for editor selectivity in a current awareness service

modus operandi: stage 2

• Editors consider the papers in the nep-all report to filter out papers that belong to the subject. This forms as issue of a subject report nep-???.

• nep-all and the subject reports are circulated via email.

• A special arrangement makes the data of NEP available to other RePEc services.

Page 14: Metadata characteristics as predictors for editor selectivity in a current awareness service

some numbers

• The are now 60+ NEP lists.

• Over 37k subscriptions.

• Close to 16k subscribers.

• Over 50k papers announced.

• Over 100k announcements.

• Homepage at http://nep.repec.org

All this is a fantastic success!!

Page 15: Metadata characteristics as predictors for editor selectivity in a current awareness service

problem with the private aim

• We would have to have all the papers to be classified not only the working papers.

• We would need to have 100% coverage of NEP.

• This means every paper in nep-all appears in at least one subject report.

Page 16: Metadata characteristics as predictors for editor selectivity in a current awareness service

coverage ratio

• We call the coverage ratio the number of papers in nep-all that have been announced in at least one subject report.

• We can define this ratio – for each nep-all issue– for a subset of nep-all issues– for NEP as a whole

Page 17: Metadata characteristics as predictors for editor selectivity in a current awareness service

coverage ratio theory & evidence

• Over time more and more NEP reports have been added. As this happens, we expect the coverage ratio to increase.

• However, the evidence, from research by Barrueco Cruz, Krichel and Trinidad is– The coverage ratio of different nep-all issues

varies a great deal. – Overall, it remains at around 70%.

• We need some theory as to why.

Page 18: Metadata characteristics as predictors for editor selectivity in a current awareness service

two theories

• Target-size theory

• Quality theory– descriptive quality– substantive quality

Page 19: Metadata characteristics as predictors for editor selectivity in a current awareness service

theory 1: target size theory

• When editors compose a report issue, they have a size of the issue in mind.

• If the nep-all issue is large, editors will take a narrow interpretation of the report subject.

• If the nep-all ratio is small, editors will take a wide interpretation of the report subject.

Page 20: Metadata characteristics as predictors for editor selectivity in a current awareness service

target size theory & static coverage

• There are two things going on– The opening new subject reports improves the

coverage ratio. – The expansion of RePEc implies that the size

of nep-all, though varying in the short-run, grows in the long run. Target size theory implies that the coverage ratio deteriorates.

• The static coverage ratio that we observe is the result of both effects canceling out.

Page 21: Metadata characteristics as predictors for editor selectivity in a current awareness service

theory 2: quality theory

• George W. Bush version of quality theory– Some papers are rubbish. They will not get

announced.– The amount of rubbish in RePEc remains

constant.– This implies constant coverage.

• Reality is slightly more subtle.

Page 22: Metadata characteristics as predictors for editor selectivity in a current awareness service

two versions of quality theory• Descriptive quality theory: papers that are

badly described– misleading titles– no abstract– languages other than English

• Substantive quality theory: papers that are well described, but not good– from unknown authors– issued by institutions with unenviable research

reputation

Page 23: Metadata characteristics as predictors for editor selectivity in a current awareness service

practical importance

• We do care whether one or the other theory is true.– Target size theory implies that NEP should open

more reports to achieve perfect coverage.– Quality theory suggests that opening more report

will have little to no impact on coverage.

• Since operating more reports is costly, there should be an optimal number of reports.

Page 24: Metadata characteristics as predictors for editor selectivity in a current awareness service

overall model

• We need an overall model that explains subject editors behavior.

• We can feed this model with variables that represent theoretical determinants of behavior.

• We can then assess the strength of various factors empirically.

Page 25: Metadata characteristics as predictors for editor selectivity in a current awareness service

method• The dependent variable is announced. It is

one if the paper has been announced, 0 otherwise.

• Since we are explaining a binary variable, we can use binary logistic regression analysis (BLRA). This is a fairly flexible technique, useful when the probability distributions governing the independent variables are not well known.

• That's why BLRA is popular in the life sciences.

Page 26: Metadata characteristics as predictors for editor selectivity in a current awareness service

independent variables: size

• size is the size of the nep-all issue in which the paper appeared.

• This is the critical indicator of target size theory. We expect it to have a negative impact on announced.

Page 27: Metadata characteristics as predictors for editor selectivity in a current awareness service

independent variables: position

• position is the position of the paper in the nep-all issue.

• The presence of this variable can be justified by the combined assumption of target size and editor myopia.

• If editors are myopic, they will be more liberal at the start of nep-all then at the end of nep-all.

Page 28: Metadata characteristics as predictors for editor selectivity in a current awareness service

independent variables: title

• title is the length of a title of the paper, measured by the number of characters.

• This variable is motivated by descriptive quality theory. A longer title will say more about the paper than a short title. This makes is less likely that a paper is being overlooked.

Page 29: Metadata characteristics as predictors for editor selectivity in a current awareness service

independent variables: abstract

• abstract is the presence/absence of an abstract to the paper.

• This is also motivated by descriptive quality theory.

• Note that we do not use the length of the abstract because that would be a highly skewed variable.

Page 30: Metadata characteristics as predictors for editor selectivity in a current awareness service

independent variables: language

• language is an indicator if the language of the metadata is in English or not.

• This variable is motivated by descriptive quality theory and the idea that English is the most commonly understood language.

• While there are a lot of multilingual editors, customizing this variable would have been rather hard.

Page 31: Metadata characteristics as predictors for editor selectivity in a current awareness service

independent variables: series• series is the size of the series where a

paper appears in.

• This variable is motivated by substantive quality theory.

• The larger a series is the higher, usually, is its reputation. We can roughly qualify by size and quality– multi-institution series (NBER, CEPR)– large departments– small departments

Page 32: Metadata characteristics as predictors for editor selectivity in a current awareness service

independent variables: author• author is the prolificacy of the authors of the

paper.

• It is justified by substantive quality theory.

• This is the most difficult variable to measure. We use the number of papers written by the registered author with the highest number.

• Since about 50% of the papers have no registered author, a lot of them are excluded. But there should be no bias by the exclusion.

Page 33: Metadata characteristics as predictors for editor selectivity in a current awareness service

create categorical variables

• size_1 [179, 326)

• size_2 [326, 835]

• title_1 [55, 77)

• title_2 [77, 1945]

• position_1 [0.357, 0.704)

• position _2 [0.704, 1.000]

• series_1 [98, 231)

• series_2 [231, 3654]

Page 34: Metadata characteristics as predictors for editor selectivity in a current awareness service

results

• P(announced=1| x) =(exp(g(x))/(1+exp(g(x))

• g(x) = 0.2401- 0.2774*size_1 - 0.4657* size_2 + 0.1512*title_1+ 0.2469*title_2 + 0.3874*abstract + 0.0001*author + 0.7667*language -0.1159*series_1 + 0.1958*series_2

• position is not significant. author just makes the cut.

Page 35: Metadata characteristics as predictors for editor selectivity in a current awareness service

odds ratio• size_1 1.32 [1.22, 1.44]

• size_2 0.83 [0.76, 0.90]

• title_1 1.16 [1.07, 1.26]

• title_2 1.28 [1.18, 1.39]

• abstract 1.47 [1.34, 1.62]

• language 2.15 [1.85, 2.51]

• series_1 1.11 [1.02, 1.20]

• series_2 1.37 [1.26, 1.49]

• author 1.05 [1.01, 1.09]

Page 36: Metadata characteristics as predictors for editor selectivity in a current awareness service

scandal!

• Substantive quality theory can not be rejected. That means that the editors are selecting for quality as well as for the subject.

• The editors have rejected our findings. Almost all protest that there is no quality filtering.

Page 37: Metadata characteristics as predictors for editor selectivity in a current awareness service

consequences

• There has been no program to expand list.

• There has to be a concentrated effort to help editors to find subject specific papers.

• More effort needs to be made for editors to really find the subject-specific papers. This can be done by – the use of a more efficient interface – the use of automated resource discovery

methods.

Page 38: Metadata characteristics as predictors for editor selectivity in a current awareness service

ernad

• editing reports on new academic documents. It is purpose-built software system for current awareness reports.

• It has been designed by Thomas Krichel, http://openlib.org/home/krichel/work/altai.html

• The system was written by Roman D. Shapiro.

Page 39: Metadata characteristics as predictors for editor selectivity in a current awareness service

statistical learning

• The idea is that a computer may be able to make decision on the current nep-all reports based on the observation of earlier editorial decisions.

• ernad now works using support vector machines (SVM), with titles, abstracts, author name, classification values and series as features.

Page 40: Metadata characteristics as predictors for editor selectivity in a current awareness service

performance criteria

• We are not aware of performance criteria for the sorting of papers in a report.

• Precision and recall appear useless.

• Expected search length and average search don't appear very attractive.

• Thus research into precise criteria is required.

Page 41: Metadata characteristics as predictors for editor selectivity in a current awareness service

SVM performance

• If we use average search length, we can do performance evaluations.

• It turns out that reports have very different forecastability. Some are almost perfect, others are weak.

• Again, this raises a few eyebrows!

Page 42: Metadata characteristics as predictors for editor selectivity in a current awareness service

what is the value of an editor?

• If the forecast is perfect, we don't need the editor.

• If the forecast is very weak the editor may be a prankster.

Page 43: Metadata characteristics as predictors for editor selectivity in a current awareness service

pre-sorting reconceived

• We should not think of pre-sorting via SVM as something to replace the editor.

• We should not think about it encouraging editors to be lazy.

• Instead, we should think it as an invitation to examine some papers more closely than others.

Page 44: Metadata characteristics as predictors for editor selectivity in a current awareness service

headline vs. bottomline data

• The editors really have a three stage process of decision. – They read title, author names.– They read the abstract.– They read the full text

• A lot of papers fail at the first hurdle.• SVM can read the abstract and prioritize

papers for abstract reading.• Editors are happy with the pre-sorting

system.

Page 45: Metadata characteristics as predictors for editor selectivity in a current awareness service

[email protected]://openlib.org/home/krichel/

Thank you for your attention!


Recommended