Free author registration Thomas Krichel LIU & НГУ 2008-12-11.

Post on 27-Mar-2015

220 views 2 download

Tags:

transcript

free author registration

Thomas KrichelLIU & НГУ

2008-12-11

me today• I am working for the Palmer School of

Library and Information Science in he College of Information and computer science of the CW Post Campus of Long Island University in Brookville NY, U.S.A. and for the Division of Information Systems in the Faculty of Information Technology at Novosibirsk State University in Novosibirsk, Russia.

• I do a lot of programming & sysadmin.

formerly

• I am a trained economist.

• My main claim to fame is the creation and and coordination of the RePEc digital library for economics at http://repec.org.

• My main area of work within RePEc is the NEP: New Economics Papers current awareness service. It's a totally different topic.

RePEc now

• It is a collection of data about academic economics.

• The bulk of the data is data about documents.

• And the bulk of that is– published article data– working paper data

• But the interesting data is the author, institution and usage data.

RePEc principle of 1997• many archives

– archives offer metadata about digital objects (mainly working papers & journal articles)

• one database – the data from all archives forms one single logical

database

• many services – users can access the data through many service – providers of archives offer their data to all

services

repec is based 900+ archives

• Blackwell• MPRA• DEGREE• S-WoPEc• NBER• CEPR• Taylor & Francis

• US Fed in Print• IMF• OECD• MIT• University of Surrey• CO PAH• Elsevier

to form a 630k item dataset

254,000 working papers

370,000 journal articles

1,600 software components

4,200 book and chapter listings

17,600 author records

10,800 institutional contact listings

RePEc is used in many services

• EconPapers

• NEP: new economics papers

• Google Scholar• RePEc Author Service• Twitter bulk posting (planned)• LogEc

• IDEAS• RuPEc• EDIRC• LogEc• CitEc• MPRA

… describes documentstemplate-type: redif-paper 1.0title: dynamic aspect of growth and fiscal policyauthor-name: thomas krichel author-person: repec:per:1965-06-05:thomas_krichelauthor-email: t.krichel@surrey.ac.uk author-name: paul levine author-email: p.levine@surrey.ac.uk author-workplace-name: university of surreyclassification-jel: c61; e21; e23; e62; o41 file-url: ftp://www.econ.surrey.ac.uk/

pub/repec/sur/surrec/surrec9601.pdf file-format: application/pdfcreation-date: 199603 revision-date: 199711 handle: repec:sur:surrec:9601

… describes persons (ras)template-type: redif-person 1.0name-full: mankiw, n. gregoryname-last: mankiwname-first: n. gregoryhandle: repec:per:1984-06-16:n__gregory_mankiwemail: ngmankiw@harvard.eduhomepage:http://post.economics.harvard.edu/faculty/ mankiw/mankiw.htmlworkplace-institution: repec:edi:deharusworkplace-institution: repec:edi:nberrusauthor-article: repec:aea:aecrev:v:76:y:1986:i:4:p:676-91author-article: repec:aea:aecrev:v:77:y:1987:i:3:p:358-74author-article: repec:aea:aecrev:v:78:y:1988:i:2:p:173-77….

… describes institutions

template-type: redif-institution 1.0 primary-name: university of surreyprimary-location: guildfordsecondary-name: department of economicssecondary-phone: (01483) 259380secondary-email: economics@surrey.ac.uksecondary-fax: (01483) 259548secondary-postal: guildford, surrey gu2 5xhsecondary-homepage: http://www.econ.surrey.ac.uk/handle: repec:edi:desuruk

author registration

• It started when JISC funding allowed us to hire a student to write an author registration system.

• The system went online as “HoPEc” in late 2000.

• It has been renamed “RePEc Author Service” (RAS).

• A 2002 grant from OSI allows for a rewrite and expansion.

researcherID

• researcherID is a system by Thomson ISI. It allows authors to find their documents

• It has been modeled after the RePEc author service.

• But the document and personal records are not freely available.

success of RAS

• Measuring the success of an author registration service is difficult in general.

• In RePEc we are fortunate that an independent list of top 1000 authors exists.

• Of those 80% are registered.

author registration ?

• Author registration is not disambiguation of names.

• Author registration is not authority control.• Author registration is usually done by

authors themselves. It involves two steps– Registrants put in some personal data.– Registrants finds in the document data records

about documents they have written.

personal data

• These contains required element:– person's name– email

• and optional elements– institutional affiliation– homepage URL

search for authorships

• This is based on a set of name variations.• A name variations is a string by which

document metadata authors may have referred to the registrant.

• Example:– Thomas Krichel– Крихель, Т.

• Registrants maintain a name variations profile.

authors

• An author is a registrant who has at least one work claim.

• Since author registration is a pionering innovation by yours truly, it's purpose is not yet clearly understood.

• A user who registers to gain access to data is called a bozo registrant.

• RAS managers periodically clear presumed bozo registrants.

free? as in $0

• Registrations don't pay in money terms for registration.

• Document data providers don't pay to have their document data list.

• Registrants data is freely available if they allow it.

free ? as in freedom

• Author records are freely available for any purpose, as long as we have registrants consent.

• Registrants' consent is assumed for anything but the email address. By default email addresses are not exported.

freedom is crucial

• Users will not register with the intention that the records will be used.

• They will prefer a system that has high re-usage.

• Therefore I am confident an open system will win over a closed system.

free document data

• In principle, document data has to contain only three fields– Title– Author name expressions– URL for further information and/or

• Such data is in principle not copyrightable. But there are still only few sources that have such data readily available.

service implementation scale

• Registration of authors can be conducted against any document datasets.

• What is the appropriate set– type scale?– subject scale?

• RAS shows it works for a single discipline scale with research paper documents, both article.

• But economics is fairly insular.

AuthorClaim.org

• Since 2008 yours truly have been working on an interdisciplinary system.

• This will be the last important project before my death.

• The idea is that it will help the fledging institutional repository (IR) movement.

• Since IRs currently are either empty or contain rubbish, AuthorClaim has to be primed with other contents.

datasets • The data used in an AuthorClaim are

– PubMed (problematic)– DBLP (XML file only)– CiteSeer– arXiv (not announced yet)– CIS (non-free dataset)– E-LIS

• Work is under way to include broad range of the repositories listed in DOAR.

PubMed

• The 800 pound gorilla of bibliographic datasets, with 17 million records.

• Free only as $0, through a convoluted license.

• In addition, NLM added the condition that I would not offer the personal records to them. Just saying that they would refuse them if I offered them was not enough for them.

DBLP

• Not freely available either. – only an XML dump of some records (individual

documents)– only for non-commercial purposes

• Overlap with CiteSeer would be nice to clean up.

CIS

• This is the Current Index to Statistics.• Not a free dataset at all but your truly has

access to a database version where extract the 3 metadata fields that are required.

DOAR repositories• DOAR repositories used the OAI-PMH

protocol. Dirty UTF-8/XML seems to the main culprit.

• Roughly, out of 1200 registered repositories, ½ work on a particular day.

• For roughly 2/3rd we can get some records by trying and stopping when the first error occurs.

• BTW RePEc makes for the second-largest DOAR repository by record number.

subject coverage and overlap

• The subject coverage of AuthorClaim will remain uneven unless publishers are giving data directly (replacing libraries, eventually).

• Overlap is less of a problem than lack of good data. RePEc routinely groups various versions of authors' work together. This is feasible if they are in the claimed set of a person.

scaling issue

• With 30 times the number of record, and with PubMed only using initials (phew!) registrants with common names have large sets of potential documents to work through.

• Clearly they also derive more benefits.• Example: Joanna P. Davies has currently

795 proposed documents. Now think about Chen or Li.

machine learning

• In a new project Илья Королёв and Thomas Krichel are working on enhancing ACIS to provide help through machine learning.

• The idea is that the users will submit a few positive and negative examples, and machine learning sorts the most likely authored documents to the front. The assessment of such a system is really interesting.

ACIS

• This is the Academic Contribution Information System.

• It is a generic software to enable author registration services that are somewhat more general.

• Work on ACIS was sponsored by the Open Society Institute.

• The software was written by Ivan V. Kurmanov. It is verrrry complicated.

basic idea

• A contribution is a relationship between document data records and personal records that a registrant can claim.

• Authorship and editorship are built-in contribution types, but others can be configured.

• The contribution system allows registrants to provide information about their contribution.

no document creation

• Using ACIS, registrants can not create document records.

• While many RAS registrants want to do this, it is considered out of scope for an ACIS installation.

• ACIS-based systems are not supposed to substitute but complement the work of publishers.

ACIS implementations and document services

• An ACIS implementation service (AIS) can work with a document submission service (DSS).

• A DSS would typically run EPrints, Dspace or Fedora-Commons.

• While such systems are distinct, on different machines etc, they can be so interconnected that they appear integrated to a naive user.

interoperability

• AIS and DSS interoperability comes in different levels.

• With each level up, we have more (better) interoperability.

• We have levels 0 to 4.

• At level zero, an AIS and an DSS simply live side by side, and no interaction is happening.

level 1

• In level 1, a DSS provides metadata about its documents to an AIS. – The data is stored in files.– in a compatible format. for ACIS this would be

AMF or ReDIF.

• The AIS processes the data periodically. – adds new records to the document data set– perform probationary associations between

documents and authors

level 2• A DSS delivers to the AIS data for some of

its authorships that point to data in the AIS. The AIS can accept any of the following 3 identification avenues– an identifier known to the AIS– a shortID, previously generated by the AIS– an email address, know to the AIS as the login

of a registrant.

• This data will have to be entered by a submitter.

level 3

• The DSS helps submitters to find the data required for level 2 interoperability.

• While submitters enter authorship data, the DSS performs searches in the AIS data. If matching records are found, the submitter is invited to select them.

• The document data is the exported to the AIS in the usual way.

implementing level 3

• The AIS needs to expose registrants data to the DSS. The data can not be made available publicly if we want the email to be an avenue of identification.

• The DSS must search the AIS data display optional matches in an unobtrusive way and give submitters an easy way to choose an option.

level 4

• The DSS immediately notifies the AIS about a document submission.

• The AIS processes the notification, the document is added to the research profiles of its identified authors.

level dependency

• There is level dependency– level 1 is really required for other levels.– level 2 is a basis for level 3.– level 4 can be done without either level 2 or

level 3.

• Current ACIS code can implement all four levels.

• There is code written for EPrints 2.0 that implements the DSS side of the interoperability.

ACIS components

• “rid” is a feeding daemon. It feeds records in files into a processor. It used the Berkeley DB transactional database system.

• “ARDB” is a software suite that implements bibliographic relational bibliographical datasets.

• There is general web application layer. It fires up XSLT.

ACIS components, a few more

• As “shortID” system associates shortIDs with documents and more importantly, registrants

• A “userData” system manages the data handled by users and feeds it back to the ARBD system.

• A “resources” system deals with searches and suggestions.

ACIS functionality

• Beside the association of documents with users, ACIS provides a range of functionality that complement or extend the basic functionality.

• I will review some now.

ACIS contact details

• This is a set of trivial fields– email. This detail is required but not exported

by default.– homepage – phone number– postal address

• We don't do pictures of the registrants' dogs etc.

affiliations profile

• This is more complicated.• Institutional data is kept as separate

records, not as string data.• Registrants can search for existing

institutional records to create an affiliation with.

• Or they can propose a new record to be added by filling out a form.

research profile

• This is collection of metadata about research documents the registrant has written.

• Available functions include– display a list of works in the profile– search for new suggested works– manual search for works by title– display refused research documents– change preferences for automatic updates

automatic updates• By default, when a document record quotes

an person short id, the document is added to the profile.

• By default, a regular search using the name variations profile identifies a set of potential new documents and reports them to the user via email.

• The registrant may choose to have exact matches of these searches being added to the research profile.

document to document links

• Document to document links can be created for authors to say that two documents in the profile are related.

• Document full-text links can be confirmed or rejected.

• Typically such full-text files would found by an automated search external to the AIS.

citations profile

• Within this profile, author can partially manage citation information for items is the research profile.

• Like a DSS may submit data to a AIS a citation discovery service may take give citations data to a AIS.

• Such data can be maintained in the citations profile.

references processing

• References are processed to see if they may correspond to a document in the research profile.

• If a document in the profile has a potential citation it is called an “interesting” document.

• Once reference processing is done, registrants can navigate by decreasing level of interest.

suggestions processing

• Registrants navigate the set of suggested citations to see if the reference string really matches the research profile item.

• If the registrant refuses a citations, there is a screen where she can later overturn such a decision.

automatic citation updates

• If the reference is very close to citation data, the registrant can have it added automatically.

• When a co-author has identified a citation to an item in her profile, the registrant can allow it to be added automatically.

thank you for your attention!

http://openlib.org/home/krichel