Metadata characteristics as predictors for editor selectivity in a current awareness service

Metadata characteristics as predictors for editor selectivity in a current

awareness service

Thomas Krichel & Nisa Bakkalbasi2005-10-31

outline

• Background to work that we did– RePEc (Research Papers in Economics)– NEP: New Economics Papers

• The research– Theory– Method– Results

• Other work done for NEP.

RePEc• Digital library for academic Economics. It

collects descriptions of– economics documents (working papers,

articles etc)– collections of those documents– economists– collections of economists

• Pionneering effort to create a relational dataset describing an academic discipline as a whole.

• The data is freely available.

RePEc principle• Many archives

– Archives offer metadata about digital objects or authors and institutions data.

• One database

• Many services

– Users can access the data through many interfaces.

– Providers of archives offer their data to all interfaces at the same time. This provides for an optimal distribution.

it's the incentives, stupid

• RePEc applies the ideas of open source to the construction of bibliographic dataset. It provides an open library.

• The entire system is constructed in such a way as to be sustainable without monetary exchange between participants.

some history

• Thomas Krichel in the early 1990s dreamed about a current awareness service for working paper. It would later have electronic papers.

• In 1993 he made the first economics working paper available online.

• In 1997 he wrote the key protocols that “govern” RePEc.

RePEc is based on 500+ archives

• WoPEc• EconWPA• DEGREE• S-WoPEc• NBER• CEPR• Elsevier

• US Fed in Print• IMF• OECD• MIT• University of Surrey• CO PAH• Blackwell

to form a 340+k item dataset

161,000 working papers

180,000 journal articles

1,300 software components

1,200 book and chapter listings

8,000 author contact & publication listings

9,100 institutional contact listings

more records than arXiv.org

RePEc is used in many services

• EconPapers

• NEP: New Economics Papers

• Inomics• RePEc author service• Z39.50 service by the DEGREE

partners

• IDEAS

• RuPEc

• EDIRC

• LogEc

• CitEc

NEP: New Economics Papers• This is a set of current awareness reports

on new additions to the working paper stock only. Journal articles would be too old.

• Founded by Thomas Krichel in 1998.

• Supported by the Economics department at WUStL.

• Initial software was written by Jose Manuel Barrueco Cruz.

• First general editor was John S. Irons.

why NEP

• Public aim: Current awareness if well done, can be an important service in its own right. It is sheltered from the competition of general search engines.

• Private aim: It is useful to have some, even though limited classification information. This should be useful in performance measures within subject areas.

modus operandi: stage 1

• The general editor uses a computer program who gathers all the new additions to the working paper stock. This is usually done weekly.

• S/he filters out new descriptions of old papers – date field– handle heuristics

• The result is an issue of the nep-all report.

modus operandi: stage 2

• Editors consider the papers in the nep-all report to filter out papers that belong to the subject. This forms as issue of a subject report nep-???.

• nep-all and the subject reports are circulated via email.

• A special arrangement makes the data of NEP available to other RePEc services.

some numbers

• The are now 60+ NEP lists.

• Over 37k subscriptions.

• Close to 16k subscribers.

• Over 50k papers announced.

• Over 100k announcements.

• Homepage at http://nep.repec.org

All this is a fantastic success!!

problem with the private aim

• We would have to have all the papers to be classified not only the working papers.

• We would need to have 100% coverage of NEP.

• This means every paper in nep-all appears in at least one subject report.

coverage ratio

• We call the coverage ratio the number of papers in nep-all that have been announced in at least one subject report.

• We can define this ratio – for each nep-all issue– for a subset of nep-all issues– for NEP as a whole

coverage ratio theory & evidence

• Over time more and more NEP reports have been added. As this happens, we expect the coverage ratio to increase.

• However, the evidence, from research by Barrueco Cruz, Krichel and Trinidad is– The coverage ratio of different nep-all issues

varies a great deal. – Overall, it remains at around 70%.

• We need some theory as to why.

two theories

• Target-size theory

• Quality theory– descriptive quality– substantive quality

theory 1: target size theory

• When editors compose a report issue, they have a size of the issue in mind.

• If the nep-all issue is large, editors will take a narrow interpretation of the report subject.

• If the nep-all ratio is small, editors will take a wide interpretation of the report subject.

target size theory & static coverage

• There are two things going on– The opening new subject reports improves the

coverage ratio. – The expansion of RePEc implies that the size

of nep-all, though varying in the short-run, grows in the long run. Target size theory implies that the coverage ratio deteriorates.

• The static coverage ratio that we observe is the result of both effects canceling out.

theory 2: quality theory

• George W. Bush version of quality theory– Some papers are rubbish. They will not get

announced.– The amount of rubbish in RePEc remains

constant.– This implies constant coverage.

• Reality is slightly more subtle.

two versions of quality theory• Descriptive quality theory: papers that are

badly described– misleading titles– no abstract– languages other than English

• Substantive quality theory: papers that are well described, but not good– from unknown authors– issued by institutions with unenviable research

reputation

practical importance

• We do care whether one or the other theory is true.– Target size theory implies that NEP should open

more reports to achieve perfect coverage.– Quality theory suggests that opening more report

will have little to no impact on coverage.

• Since operating more reports is costly, there should be an optimal number of reports.

overall model

• We need an overall model that explains subject editors behavior.

• We can feed this model with variables that represent theoretical determinants of behavior.

• We can then assess the strength of various factors empirically.

method• The dependent variable is announced. It is

one if the paper has been announced, 0 otherwise.

• Since we are explaining a binary variable, we can use binary logistic regression analysis (BLRA). This is a fairly flexible technique, useful when the probability distributions governing the independent variables are not well known.

• That's why BLRA is popular in the life sciences.

independent variables: size

• size is the size of the nep-all issue in which the paper appeared.

• This is the critical indicator of target size theory. We expect it to have a negative impact on announced.

independent variables: position

• position is the position of the paper in the nep-all issue.

• The presence of this variable can be justified by the combined assumption of target size and editor myopia.

• If editors are myopic, they will be more liberal at the start of nep-all then at the end of nep-all.

independent variables: title

• title is the length of a title of the paper, measured by the number of characters.

• This variable is motivated by descriptive quality theory. A longer title will say more about the paper than a short title. This makes is less likely that a paper is being overlooked.

independent variables: abstract

• abstract is the presence/absence of an abstract to the paper.

• This is also motivated by descriptive quality theory.

• Note that we do not use the length of the abstract because that would be a highly skewed variable.

independent variables: language

• language is an indicator if the language of the metadata is in English or not.

• This variable is motivated by descriptive quality theory and the idea that English is the most commonly understood language.

• While there are a lot of multilingual editors, customizing this variable would have been rather hard.

independent variables: series• series is the size of the series where a

paper appears in.

• This variable is motivated by substantive quality theory.

• The larger a series is the higher, usually, is its reputation. We can roughly qualify by size and quality– multi-institution series (NBER, CEPR)– large departments– small departments

independent variables: author• author is the prolificacy of the authors of the

paper.

• It is justified by substantive quality theory.

• This is the most difficult variable to measure. We use the number of papers written by the registered author with the highest number.

• Since about 50% of the papers have no registered author, a lot of them are excluded. But there should be no bias by the exclusion.

create categorical variables

• size_1 [179, 326)

• size_2 [326, 835]

• title_1 [55, 77)

• title_2 [77, 1945]

• position_1 [0.357, 0.704)

• position _2 [0.704, 1.000]

• series_1 [98, 231)

• series_2 [231, 3654]

results

• P(announced=1| x) =(exp(g(x))/(1+exp(g(x))

• g(x) = 0.2401- 0.2774*size_1 - 0.4657* size_2 + 0.1512*title_1+ 0.2469*title_2 + 0.3874*abstract + 0.0001*author + 0.7667*language -0.1159*series_1 + 0.1958*series_2

• position is not significant. author just makes the cut.

odds ratio• size_1 1.32 [1.22, 1.44]

• size_2 0.83 [0.76, 0.90]

• title_1 1.16 [1.07, 1.26]

• title_2 1.28 [1.18, 1.39]

• abstract 1.47 [1.34, 1.62]

• language 2.15 [1.85, 2.51]

• series_1 1.11 [1.02, 1.20]

• series_2 1.37 [1.26, 1.49]

• author 1.05 [1.01, 1.09]

scandal!

• Substantive quality theory can not be rejected. That means that the editors are selecting for quality as well as for the subject.

• The editors have rejected our findings. Almost all protest that there is no quality filtering.

consequences

• There has been no program to expand list.

• There has to be a concentrated effort to help editors to find subject specific papers.

• More effort needs to be made for editors to really find the subject-specific papers. This can be done by – the use of a more efficient interface – the use of automated resource discovery

methods.

ernad

• editing reports on new academic documents. It is purpose-built software system for current awareness reports.

• It has been designed by Thomas Krichel, http://openlib.org/home/krichel/work/altai.html

• The system was written by Roman D. Shapiro.

statistical learning

• The idea is that a computer may be able to make decision on the current nep-all reports based on the observation of earlier editorial decisions.

• ernad now works using support vector machines (SVM), with titles, abstracts, author name, classification values and series as features.

performance criteria

• We are not aware of performance criteria for the sorting of papers in a report.

• Precision and recall appear useless.

• Expected search length and average search don't appear very attractive.

• Thus research into precise criteria is required.

SVM performance

• If we use average search length, we can do performance evaluations.

• It turns out that reports have very different forecastability. Some are almost perfect, others are weak.

• Again, this raises a few eyebrows!

what is the value of an editor?

• If the forecast is perfect, we don't need the editor.

• If the forecast is very weak the editor may be a prankster.

pre-sorting reconceived

• We should not think of pre-sorting via SVM as something to replace the editor.

• We should not think about it encouraging editors to be lazy.

• Instead, we should think it as an invitation to examine some papers more closely than others.

headline vs. bottomline data

• The editors really have a three stage process of decision. – They read title, author names.– They read the abstract.– They read the full text

• A lot of papers fail at the first hurdle.• SVM can read the abstract and prioritize

papers for abstract reading.• Editors are happy with the pre-sorting

system.

[email protected]://openlib.org/home/krichel/

Thank you for your attention!

Date post:	22-Jan-2016
Category:	Documents
Upload:	karim
View:	20 times
Download:	0 times

Metadata characteristics as predictors for editor selectivity in a current awareness service

Documents