Maurizio Lenzerini July 2015
EUR 27388 EN
Corroboration of the algorithm
and the technical specifications for the
European Map of Excellence
and Specialization (EMES)
EUROPEAN COMMISSION
Directorate-General for Research and Innovation Directorate A Policy Development and coordination Unit A6 Science Policy, foresight and data
Contact: Emanuele Barbarossa, Katarzyna Bitka
E-mail: [email protected]
European Commission B-1049 Brussels
mailto:[email protected]
EUROPEAN COMMISSION
Corroboration of the algorithm and the technical
specifications for the
European Map of Excellence and Specialization (EMES)
Maurizio Lenzerini
This document is based on projects carried out by the ONTORES research group at
Sapienza University of Rome. The contributions of Alessandro Bartolucci, Cinzia Daraio,
Camil Demetrescu, Emanuele Fusco, Claudio Leporelli, Henk F. Moed, and Paolo Naggar
are warmly acknowledged.
Directorate-General for Research and Innovation
2015 Research, Innovation, and Science Policy Experts High Level Group EUR 27388 EN
LEGAL NOTICE
This document has been prepared for the European Commission however it reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein.
More information on the European Union is available on the internet (http://europa.eu).
Luxembourg: Publications Office of the European Union, 2015.
ISBN 978-92-79-50353-5
doi 10.2777/566251
ISSN 1831-9424
European Union, 2015. Reproduction is authorised provided the source is acknowledged.
EUROPE DIRECT is a service to help you find answers
to your questions about the European Union
Freephone number (*): 00 800 6 7 8 9 10 11
(*) The information given is free, as are most calls (though some operators, phone boxes or hotels may charge you)
4
Table of contents
TABLE OF CONTENTS ....................................................................................... 4
ABSTRACT ...................................................................................................... 6
EXECUTIVE SUMMARY ...................................................................................... 7
RESUM ......................................................................................................... 9
INTRODUCTION .............................................................................................. 11
1. AN OBDM APPROACH TO DESIGN AN EMES SUSTAINABLE OVER TIME .............. 12
1.1 Difficulties in accessing and managing distributed and heterogeneous
data ............................................................................................... 14
1.2 What is OBDM .................................................................................... 15
2. FUNDAMENTAL MODULES OF AN ONTOLOGY FOR EMES .................................. 16
2.1 The module Agent .............................................................................. 16
Subject ......................................................................................... 16
Agent ......................................................................................... 17
Natural person ................................................................................ 17
Organization ................................................................................... 18
2.2 The module Research .......................................................................... 18
Research outputs and outcomes ........................................................ 18
Technological transfers ..................................................................... 21
2.3 The module Publishing ........................................................................ 21
Publication ...................................................................................... 22
Content ......................................................................................... 23
References ...................................................................................... 23
2.4 The module Space .............................................................................. 23
Space ......................................................................................... 23
Place ......................................................................................... 24
Territory ......................................................................................... 25
2.5 The Author Affiliation View................................................................... 26
3. TECHNICAL SPECIFICATIONS FOR THE EUROPEAN MAP OF EXCELLENCE
(EME) ..................................................................................................... 28
3.1. Gathering data about European Universities ......................................... 28
3.2 Problems in the disambiguation of institution affiliations in scientific
publications .................................................................................... 29
Main technical and theoretical issues .................................................. 31
Key elements of the proposed affiliation disambiguation approach ......... 32
3.3 What has been done in the GRBS (Global Research Benchmarking
System) ......................................................................................... 35
3.4 Identification of PROs: Towards an authority file for PROs ....................... 37
3.5 Disambiguation of affiliations of European research institutes.
Description of a successful approach and the main steps of its
algorithm ........................................................................................ 37
3.6 An examination of the conditions under which the bibliometric
databases underlying the European Map of Excellence (EME) are
available for use in the EME project: problems and options ................... 44
4. ASSESSMENT ............................................................................................. 46
5. MAIN RECOMMENDATIONS .......................................................................... 47
REFERENCES ................................................................................................. 50
5
FURTHER DOCUMENTS .................................................................................... 53
APPENDICES .................................................................................................. 54
Appendix 1: Technical description of the algorithm for disambiguation of
affiliations of European research institutes .......................................... 54
Appendix 2: The Author Affiliation View of the ontology ................................ 62
6
ABSTRACT
We present a study whose goal is to deal with three problems:
Establishing a comprehensive list of research organizations, in particular, Public Research
Organizations,
Disambiguating the ways in which research organizations show up in bibliographic data, and
Locating the scientific production of research organizations at geographic level. We address the above issues by resorting to a so-called Ontology-Based Data Management approach, where a conceptual representation of the relevant concepts and relationships, called ontology, is used for modelling the data of the underlying information system. Using this approach, we illustrate some principles for designing an European Map of Excellence and Specialisation sustainable over time.
7
EXECUTIVE SUMMARY
The European scientific system is formed by two large segments: universities and Public Research Organisations (PROs). These two segments represent, in fact, the bulk of scientific publications. The scientific output of universities is the object of a large effort in the dedicated
literature, as well as in various types of university rankings. The European Commission itself has supported the launch of the U-Multirank exercise. In addition, the geographic location of university activity is made easier by the fact that universities are typically mono-plant organizations.
On the contrary, little is known, on a systematic and comparable basis, on the scientific output of Public Research Organisations. More specifically, the following issues are still open: (a) establishing a comprehensive list of PROs; (b) disambiguating the innumerable ways in which
PROs show up in bibliometric information and normalizing affiliations; (c) locating the scientific production of PROs at geographic level.
With respect to the list, it would be important to search for all European institutions beyond a given threshold of visibility in order to build up a repository.
Disambiguation is a very complex undertaking, since there are many ways in which the names of PROs appear (for example, in many cases only the name of the institute is visible, with the
name of the umbrella PROs not shown, or shown as an abbreviation, or within the overall name
of the institute etc.).
Finally, the geographic referentiation of publications of PROs is made difficult by the fact that the location of the laboratory or institute does not necessarily appear in the affiliations of the articles in bibliometric databases.
It is understood that all three elements will require a dedicated large scale effort in the future.
The current study has been carried out as a proof-of-concept, needed in order to verify the
feasibility and to provide the base for estimate the time and cost involved in a full scale exercise.
The study has critically examined the technical feasibility and validity of current existing methods and approaches to address the open issues recalled above.
The assessment of the feasibility has been based on the following criteria:
availability of data on publications: It is very important to clearly define the availability of data on publications in terms of sources, commercial conditions and regular updating.
degree of automation of the procedures. The application of the algorithm described in this study for a full scale exercise at European level does not present computational time problems. As showed in the related sections, some work is required for the improvement of the master list with particular attention to its maintenance and updating over time. Here an effort is needed to develop the data infrastructure able to improve and maintain the master list over time. The OBDM approach illustrated in this feasibility is a very promising tool that
we recommend to adopt.
operational conditions needed. As largely discussed in the study, it is important to involve universities and PROs branches (institutions) to check the list of variants of all institutions names found in publications names. Their involvement is essential also for checking and updating this list, developing a system of incentives for the involved research institutions to ensure their collaboration over time.
A summary of our recommendations follows.
Adopt an Ontology-Based Data Management (OBDM) approach to ensure replicability, extensions and updating of the system over time.
The availability of data on publications in terms of sources, commercial conditions and regular updating should be clearly discussed and settled. It is therefore of the utmost importance that the European Commission, informed by experts in the field, considers to start negotiating in an early stage with producers of large bibliographic databases about the conditions of use of these databases as sources in the creation and maintenance of the public
8
information system at stake, and reach with one or more producers an appropriate license agreement covering a sufficiently long time period.
The disambiguation is feasible, as illustrated in the study presented here. The main guidelines for its successful implementation can be summarized as follows.
General principles:
Use a validated thesaurus or authority file;
Use advanced disambiguation software;
Start from the raw affiliation data; do not use ill-understood features implemented by indexers;
Use a validated thesaurus or authority file; consult national experts;
Consult national and institutional experts; create an annotation file with background info on how a institution is defined;
Initiate further research into this issue. Examine use of information from funding acknowledgements
Technical features to consider:
For Higher Education Institutions, adopt the ID proposed by the ETER project, funded by DG Education and Culture in collaboration with DG Research and Eurostat.
A system of ID for the PROs and their hierarchical organization should be developed.
Compliance with existing standards related to the research system should be considered, by referring, for instance, to the following:
i. ORCID (http://orcid.org/) is a non-profit organization, supported by research organizations, agencies, providers of publication management systems, and publishers, aiming at giving all researchers a unique identifier (ORCID_id number) and keeping it persistent over time. Established at the end of 2009, but operational since end 2012, it has
almost reached one million researchers worldwide. Most of the increase has been achieved in a very short time frame: from 100,000 in March 2013 to almost 970,000 as of October 2014 (with 35% from European, Middle East and Asian countries);
ii. CERIF is a Europe-based initiative aiming at standardizing the
operations of funding agencies (http://www.eurocris.org);
iii. CASRAI (www.casrai.org) is a Canada-US initiative for the
standardization of data on research institutions and funders (also supported by a committee of Science Europe; http://www.scienceeurope.org/scientific-committees/Life-sciences/life-sciences-committee);
iv. ISNI (www.isni.org) provides lists and metadata on higher education, research, funding and many other types of organizations, while Ringgold
(www.ringgold.com) does the same in the world of publishers and intermediaries.
The Master list should be maintained at the official level by the Commission.
http://www.eurocris.org/http://www.scienceeurope.org/scientific-committees/Life-sciences/life-sciences-committeehttp://www.scienceeurope.org/scientific-committees/Life-sciences/life-sciences-committeehttp://www.isni.org/
9
RSUM
Le systme scientifique europenne est form par deux grands segments: les universits et les organismes publics de recherche (PRO). Ces deux segments reprsentent, en fait, la plus grande partie des publications scientifiques. La production scientifique des universits est l'objet
d'un grand effort dans la littrature spcialise, ainsi que dans divers types de classement des universits. La Commission Europenne elle-mme a soutenu le lancement de l'exercice U-Multirank. En outre, la situation gographique de l'activit universitaire est rendue plus facile par le fait que les universits sont gnralement des organisations "mono-plante", tant situ dans une seule ville dans environ 95% des cas.
Au contraire, on sait peu, sur une base systmatique et comparable, sur la production scientifique des organismes publics de recherche. Plus spcifiquement, les questions suivantes
sont encore ouverts:
tablir une liste exhaustive des PRO;
dsambiguser les innombrables manires dont les PRO se manifestent dans l'information bibliomtrique et affiliations normalisation;
la localisation de la production scientifique des PRO au niveau gographique.
En ce qui concerne la liste, il serait important de construire un rfrentiel. Aussi l'homonymie
est complexe, car il existe de nombreuses faons dans lequel les noms de PRO apparaissent. Enfin, la rfrenciation gographique des publications des PRO est difficile.
Donc, pour rsoudre ces trois questions, il faudra un effort spcifique.
L'tude actuelle a t ralise comme une preuve de concept, et a examin de faon critique la faisabilit technique et la validit des mthodes existantes et les approches actuelles pour rpondre aux questions ouvertes rappeles ci-dessus.
L'valuation de la faisabilit a t bas sur les critres suivants:
La disponibilit de donnes sur les publications: Il est trs important de dfinir clairement la disponibilit des donnes sur les publications en termes de sources, les conditions commerciales et mise jour rgulire.
Degr d'automatisation des procdures. L'application de l'algorithme dcrit dans cette tude pour un exercice grande chelle au niveau europen ne prsente pas de problmes de
temps de calcul.
Les conditions oprationnelles ncessaires. Comme largement discut dans l'tude, il est
important d'impliquer les universits et les pros branches (institutions) pour vrifier la liste des variantes de noms de tous les tablissements trouvs dans les noms de publications.
Un rsum de nos recommandations suit.
Adopter une approche de OBDM pour assurer rplicabilit, extensions et la mise jour du systme au fil du temps.
La disponibilit de donnes sur les publications en termes de sources, les conditions
commerciales et mise jour rgulire doit tre clairement discute et rgle. Il est donc de la plus haute importance que la Commission europenne, a inform par des experts dans le domaine, estime de commencer ngocier un stade prcoce avec les producteurs de grandes bases de donnes bibliographiques sur les conditions d'utilisation de ces bases de donnes en tant que sources dans la cration et l'entretien de la systme d'information du public en jeu, et d'atteindre avec un ou plusieurs producteurs, un accord de licence
approprie couvrant une priode de temps suffisamment longue.
La dsambigusation est possible, comme illustr dans l'tude prsente ici. Les principales lignes directrices pour sa mise en uvre russie peuvent tre rsumes comme suit.
Principes gnraux:
Utiliser un thsaurus valid ou fichier d'autorit;
Utiliser un logiciel d'homonymie avance;
10
Commencer partir des donnes d'affiliation brut; ne pas utiliser les fonctions mal comprises mises en uvre par les indexeurs;
Utiliser un thsaurus valid ou fichier d'autorit; consulter des experts nationaux;
Consulter des experts nationaux et institutionnels; crer un fichier d'annotation avec des
informations de base sur la faon dont un tablissement est dfinie;
Initier de nouvelles recherches sur cette question. Examiner l'utilisation des informations de remerciements de financement.
Caractristiques techniques prendre en considration:
Pour les tablissements d'enseignement suprieur, d'adopter l'ID propose par le projet ETER, financ par la DG Education et Culture, en collaboration avec la DG Recherche et Eurostat.
Un systme d'identification pour les pros et leur organisation hirarchique devrait tre labor. Conformit avec les normes existantes relatives au systme de recherche
devrait tre considre, en se rfrant, par exemple, ce qui suit:
ORCID (http://orcid.org/) est une organisation but non lucratif, soutenue par les organismes de recherche, les agences, les fournisseurs de systmes de gestion de la publication, et des diteurs, visant donner tous les chercheurs d'un identifiant unique (ORCID_id nombre) et de le garder persistante au fil du temps. Fonde la fin de 2009, mais oprationnel depuis la fin de 2012, il a presque atteint un million de
chercheurs du monde entier. La plupart de l'augmentation a t ralis dans un laps de
temps trs court: de 100.000 en Mars 2013 pour prs de 970 000 d'Octobre 2014 (avec 35% d'Europe, du Moyen-Orient et les pays d'Asie);
i. CERIF est une initiative de l'Europe visant standardiser les oprations des organismes de financement (http://www.eurocris.org);
ii. CASRAI (www.casrai.org) est une initiative canado-amricaine pour la normalisation des donnes sur les institutions de recherche et des bailleurs de fonds (galement soutenu par un comit de la science en
Europe; http://www.scienceeurope.org/scientific-committees/Life- Sciences / vie-sciences-comit);
iii. ISNI (www.isni.org) fournit des listes et des mtadonnes sur l'enseignement suprieur, de la recherche, du financement et de nombreux autres types d'organisations, tout en Ringgold (www.ringgold.com) fait la mme chose dans le monde des diteurs et
des intermdiaires.
La liste principale devrait tre maintenue au niveau officiel par la Commission.
11
INTRODUCTION
This document reports on a study whose main goal is to address three problems: (1) establishing a comprehensive list of research organizations, in particular, Public Research Organizations - PROs, (2) disambiguating the ways in which research organizations (in
particular PROs) show up in bibliographic data, and (3) locating the scientific production of research organizations (in particular PROs) at geographic level.
This document is structured as follows. In Section 1 we present an approach (called Ontology-based Data Management) to design an European Map of Excellence and Specialisation (EMES) sustainable over time. In Section 2 we briefly describe the most relevant modules of an ontology for EMES that we are currently developing at Sapienza University of Roma. In Section 3 we address the technical problems related to EMES.
In Section 4 we illustrate an assessment of the feasibility of the activities described in this document, whereas in Section 5 we report our main recommendations for the design of a sustainable information systems for EMES.
In the document, we will explicitly refer to the Technical Specification of the Tender, namely:
Creation of an Authority File Nomenclature of Units for Territorial Statistics (NUTS 2) in which all universities in EU 28 + Switzerland and Norway have a precise geographic location using
GIS coordinates. This authority file will be provided on the basis of recent projects carried out on the topic (i.e. ETER and EUMIDA) and other research activities realized at the Sapienza University of Rome.
Disambiguation and validation of affiliations of academic institutions used in publications indexed in bibliometric databases. This activity will be provided on the basis of recent projects carried out on the topic (i.e. ETER and EUMIDA) and other research activities carried out at the Sapienza University of Rome based on one bibliometric database.
Identification of public research institutions in affiliations used on publications indexed in bibliometric databases and assigning these to Nomenclature of Units for Territorial Statistics (NUTS2 and NUTS3) regions. Proposal of an approach for identifying PRI from affiliations of publications indexed in a bibliometric database and assigning to these the NUTS2 and NUTS3 regions. The approach will be tested on a sample of raw data extracted from a bibliometric database for a large PRO. A detailed description of the approach will be provided in order to replicate on a full scale the proposed approach.
An examination of the conditions under which the bibliometric databases underlying the EME
are available for use in the EME project.
12
1. AN OBDM APPROACH TO DESIGN AN EMES SUSTAINABLE OVER TIME
The quantitative analysis of Science and Technology is becoming a big data science, with an increasing level of computerization, in which large and heterogeneous datasets on various aspects are combined. In this context, understanding and formally specifying the meaning of data is of paramount importance.1
Within this framework, optimistic views, supporting the end of theory in favour of data-driven science (Kitchin 2014), have been opposed to more critical positions in favour of theory-driven
scientific discoveries (Frick 2014). It has been rightly highlighted that Data are not simply addenda or second-order artifacts; rather, they are the heart of much of the narrative literature, the protean stuff that allows for inference, interpretation, theory building, innovation, and invention (Cronin, 2013, p. 435). Moreover, the need for accountability of Science, Technology and Innovation (STI) activities to sustain their funding in the current difficult economic and financial situation is increasingly asking for rigorous empirical evidence to support informed policy making.
The needs to overcome the logic of rankings and the new trends in indicators development, including granularity and cross-referencing, can be explored and exploited in open data
platforms with a clear description of the main concepts of the domain (Daraio & Bonaccorsi 2014). The multidimensionality of research assessment and scholarly impact (Moed & Halevi 2015), and the recent altmetrics movements (Cronin & Sugimoto 2014), are questioning the traditional approach in indicators development.
Research assessment, indeed, is becoming increasingly complex due to its multi-dimensionality
nature. A Report published in 2010 by the Expert Group on the Assessment of University-Based Research, installed by the European Commission proposed a consolidated multidimensional methodological approach addressing the various user needs, interests and purposes, and identifying data and indicator requirements (AUBR 2010, p. 10). A key notion holds that indicators designed to meet a particular objective or inform one target group may not be adequate for other purposes or target groups. Diverse institutional missions, and different
policy environments and objectives require different assessment processes and indicators. In addition, the range of people and organizations requiring information about university based research is growing. Each group has specific but also overlapping requirements (AUBR 2010, p. 51).
Printed outputs (texts) Non-printed outputs (non-text)
Main type of impact
Scientific journal paper; book
chapter; scholarly monograph
Research data file; video of experiment; software
Scientific-scholarly
Patent; commissioned research report;
New product or process;
material; device; design; image; spin off
Economic or technological
Professional guidelines;
newspaper article; communication submitted to social media, including blogs, tweets.
Interview; event; art
performance; exhibit; artwork; scientific-scholarly advise;
Social or cultural
A research assessment has to take into account a range of different types of research output and impact. As regards output forms, one important distinction is between text-based and non-
text based output forms. The main types are presented in Table 1. This table is not fully
comprehensive. The specifications of the Panel Criteria in the Research Excellence Framework in the UK (REF 2012, page 51 a.f.) provide more detailed lists of possible output forms arranged by major research discipline. Table 1 includes forms that are becoming increasingly important such as research data files, and communications submitted to social media and scholarly blogs.
1 This section is taken from Lenzerini (2015) and Daraio, Lenzerini et al. (2015), to which the reader is
referred for the references therein.
Table 1: Main types of research outputs (source: Daraio, Lenzerini et al, 2015)
13
A framework for the assessment of these forms is being developed in the field of altmetrics (e.g., Taylor 2013). The last column indicates the main types of impact a particular output may have. A distinction is made between scientific-scholarly impact, and more wider impact outside the domain of science and scholarship, denoted as societal, a concept that embraces technological, economic, social and cultural impact. A comprehensive overview of the types of
impact, and the most frequently used impact indicators is presented in Table 2. The reader is referred to AUBR (2010) and Moed & Halevi (2015) for a further discussion of this table.
Type of impact Short Description; Typical examples Indicators (examples)
Scientific-scholarly or academic
Knowledge
growth
Contribution to scientific-scholarly
progress: creation of new scientific
knowledge
Indicators based on publications
and citations in peer-reviewed
journals and books
Research
networks
Integration in (inter)national scientific-
scholarly networks and research teams
(inter)national collaborations
including co-authorships;
participation in emerging topics
Publication
outlets
Effectiveness of publication strategies;
visibility and quality of used publication
outlets
Journal impact factors and other
journal metrics; diversity of used
outlets;
Societal
Social Stimulating new approaches to social
issues; informing public debate and
improve policymaking; informing
practitioners and improving
professional practices; providing
external users with useful knowledge;
Improving peoples health and quality
of life; Improvements in environment
and lifestyle;
Citations in medical guidelines
or policy documents to
research articles
Funding received from end-
users
End-user esteem (e.g.,
appointments in
(inter)national organizations,
advisory committees)
Juried selection of artworks for
exhibitions
Mentions of research work in
social media
Technological Creation of new technologies (products
and services) or enhancement of
existing ones based on scientific
research
Citations in patents to the
scientific literature (journal
articles)
Economic Improved productivity; adding to
economic growth and wealth creation;
enhancing the skills base; increased
innovation capability and global
competitiveness; uptake of recycling
techniques;
Revenues created from the
commercialization of research
generated intellectual property
(IP)
Number patents, licenses,
spin-offs
Number of PhD and equivalent
research doctorates
Employability of PhD
graduates
Table 2: Types of Research Impact and Indicators (source: Daraio, Lenzerini et al. 2015)
14
Cultural Supporting greater understanding of
where we have come from, and who
and what we are; bringing new ideas
and new modes of experience to the
nation.
Media (e.g. TV) performances
Essays on scientific
achievements in newspapers
and weeklies
Mentions of research work in
social media
It is also important to include the inputs in the analysis; they should be jointly analysed with the outputs to assess the overall impact of the process (see e.g. Daraio et al. 2014, for a conditional multidimensional approach to rank higher education institutions).
To meet all these new trends and policy needs a shift in the paradigm of data integration for research assessment is needed. In this paper we advocate an OBDM approach to research
assessment.
1.1 Difficulties in accessing and managing distributed and
heterogeneous data
While the amount of data stored in current information systems and the processes making use
of such data continuously grow, turning these data into information, and governing both data
and processes are still tremendously challenging tasks for Information Technology. The problem is complicated due to the proliferation of data sources and services both within a single organization, and in cooperating environments. The following factors explain why such a proliferation constitutes a major problem with respect to the goal of carrying out effective data governance tasks:
Although the initial design of a collection of data sources and services might be adequate,
corrective maintenance actions tend to re-shape them into a form that often diverges from the original conceptual structure.
It is common practice to change a data source (e.g., a database) so as to adapt it both to specific application-dependent needs, and to new requirements. The result is that data sources often become data structures coupled to a specific application (or, a class of applications), rather than application-independent databases.
The data stored in different sources and the processes operating over them tend to be
redundant, and mutually inconsistent, mainly because of the lack of central, coherent and
unified coordination of data management tasks.
The result is that information systems of medium and large organizations are typically structured according to a sylos-based architecture, constituted by several, independent, and distributed data sources, each one serving a specific application. This poses great difficulties with respect to the goal of accessing data in a unified and coherent way. Analogously, processes relevant to the organizations are often hidden in software applications, and a formal, up-to-date
description of what they do on the data and how they are related with other processes is often missing. The introduction of service-oriented architectures is not a solution to this problem per se, because the fact that data and processes are packed into services is not sufficient for making the meaning of data and processes explicit. Indeed, services become other artifacts to document and maintain, adding complexity to the governance problem. Analogously, data warehousing techniques and the separation they advocate between the management of data for the operation level, and data for the decision level, do not provide solutions to this challenge.
On the contrary, they also add complexity to the system, by replicating data in different layers of the system, and introducing synchronization processes across layers. All the above observations show that a unified access to data and an effective governance of processes and
services are extremely difficult goals to achieve in modern information systems. Yet, both are crucial objectives for getting useful information out of the information system, as well as for taking decisions based on them.
This explains why organizations spend a great deal of time and money for the understanding, the governance, the management, and the integration of data stored in different sources, and of the processes/services that operate on them, and why this problem is often cited as a key and costly Information Technology challenge faced by medium and large organizations today (Bernstein & Haas, 2008). We argue that ontology-based data management (OBDM, Lenzerini 2011) is a promising direction for addressing the above challenges.
15
1.2 What is OBDM
The key idea of OBDM is to resort to a three-level architecture, constituted by the ontology, the
sources, and the mapping between the two. The ontology is a conceptual, formal description of the domain of interest to a given organization (or, a community of users), expressed in terms of
relevant concepts, attributes of concepts, relationships between concepts, and logical assertions characterizing the domain knowledge. The data sources are the repositories accessible by the organization where data concerning the domain are stored. In the general case, such repositories are numerous, heterogeneous, each one managed and maintained independently
from the others. The mapping is a precise specification of the correspondence between the data contained in the data sources and the elements of the ontology.
The main purpose of an OBDM system is to allow information consumers to query the data using the elements in the ontology as predicates. In this sense, OBDM can be seen as a form of information integration, where the usual global scheme is replaced by the conceptual model of the application domain, formulated as an ontology expressed in a logic-based language. With this approach, the integrated view that the system provides to information consumers is not
merely a data structure accommodating the various data at the sources, but a semantically rich description of the relevant concepts in the domain of interest, as well as the relationships between such concepts. The distinction between the ontology and the data sources reflects the separation between the conceptual level, the one presented to the client, and the logical/physical level of the information system, the one stored in the sources, with the mapping
acting as the reconciling structure between the two levels. This separation brings several potential advantages:
The ontology layer in the architecture is the obvious mean for pursuing a declarative approach to information integration, and, more generally, to data governance. By making the representation of the domain explicit, we gain re-usability of the acquired knowledge, which is not achieved when the global schema is simply a unified description of the underlying data sources.
The mapping layer explicitly specifies the relationships between the domain concepts on the
one hand and the data sources on the other hand. Such a mapping is not only used for the operation of the information system, but also for documentation purposes. The importance of this aspect clearly emerges when looking at large organisations where the information about data is widespread into separate pieces of documentation that are often difficult to access and rarely conforming to common standards. The ontology and the corresponding mappings to the data sources provide a common ground for the documentation of all the data in the organisation, with obvious advantages for the governance and the management of the
information system.
A third advantage has to do with the extensibility of the system. One criticism that is often raised to data integration is that it requires merging and integrating the source data in advance, and this merging process can be very costly. However, the ontology-based approach we advocate does not impose to fully integrate the data sources at once. Rather, after building even a rough skeleton of the domain model, one can incrementally add new data sources or new elements therein, when they become available, or when needed, thus
amortising the cost of integration. Therefore, the overall design can be regarded as the incremental process of understanding and representing the domain, the available data sources, and the relationships between them. The goal is to support the evolution of both the ontology and the mappings in such a way that the system continues to operate while evolving, along the lines of "pay-as-you-go" data integration.
The notions of ODBM were introduced in (Calvanese et al. 2007; Poggi et al. 2008, Lenzerini
2011), and originated from several disciplines, in particular, Information Integration, Knowledge Representation and Reasoning, and Incomplete and Deductive Databases. The central notion of OBDM is therefore the ontology, and reasoning over the ontology is at the basis of all the tasks that an OBDM system has to carry out. In particular, the axioms of the ontology allow one to
derive new facts from the source data, and these inferred facts greatly influence the set of answers that the system should compute during query processing. In the last decades, research on ontology languages and ontology inferencing has been very active in the area of Knowledge
Representation and Reasoning. Description Logics (DLs, Baader et al. 2007) are widely recognized as appropriate logics for expressing ontologies, and are at the basis of the W3C standard ontology language OWL. These logics permit the specification of a domain by providing the definition of classes and by structuring the knowledge about the classes using a rich set of logical operators. They are decidable fragments of mathematical logic, resulting from extensive investigations on the trade-off between expressive power of Knowledge Representation languages, and computational complexity of reasoning tasks. Indeed, the constructs appearing
16
in the DLs used in OBDI are carefully chosen taking into account such a trade-off (Calvanese et al. 2007). As indicated above, the axioms in the ontology can be seen as semantic rules that are used to complete the knowledge given by the raw facts determined by the data in the sources. In this sense, the source data of an OBDI system can be seen as an incomplete database, and query answering can be seen as the process of computing the answers logically deriving from
the combination of such incomplete knowledge and the ontology axioms. Therefore, at least conceptually, there is a connection between OBDM and the two areas of incomplete information
and deductive databases (Ceri et al. 1990).
2. FUNDAMENTAL MODULES OF AN ONTOLOGY FOR EMES
In this section we describe a first draft of an ontology for EMES, that we call Sapientia (here, we refer to the beta version of Sapientia 1.0). This first draft should be seen as a first step towards
a complete ontology for research assessment, and is constituted by four modules, each one centered around one of the following concepts: Agent, Research, Publishing, and Space.
We observe that Sapientia in under development at the "Dipartimento di Ingegneria Informatica Automatica e Gestionale" of Sapienza Universit di Roma. What we illustrate here is a first, incomplete draft whose aim is to provide some evidence of the fact that an ontology for research assessment can be done, and to show some of the benefits that such an ontology can
provide. The draft is incomplete for two main reasons:
not all the relevant concepts are included in the modules that we present, and not all relevant characteristics of the concepts are included; this is due both to the lack of space, and to the fact that our ontology is not yet complete;
the properties of the concepts that we model here are only static properties. In the full version of the ontology, also dynamic properties (those varying in time) are modelled, by introducing suitable time-dependent relations associated to the concepts.
The modules are specified in OWL 2, the standard logic-based ontology language promoted by the W3C. In order to provide an intuitive account of the various modules, we describe them in term of a diagrammatic representation.
The language for expressing the ontology is Graphol, designed at the "Dipartimento di Ingegneria Informatica Automatica e Gestionale" of Sapienza Universit di Roma. A tutorial on this language is available at
http://www.dis.uniroma1.it/~graphol/documentation/GrapholIntro.pdf.
The modules expressed in Graphol are available on request.
In what follows, we devote one subsection for each of the above mentioned modules. After such subsections, we also present a view of the ontology that is particularly relevant for the remaining sections of this document. The view collects various concepts that are present in the four modules, and present them in a single diagram, with the goal of highlighting the relevant relations. The main concept characterizing the view is Affiliation, and for this reason we simply
call the view "Author Affiliation View".
2.1 The module Agent
This module concerns the agents that are relevant in our domain. An agent is any subject operating in the research world, and this module aims at describing and classifying the various types of agents. More precisely, an agent is a role that a subject assumes when carrying on activities related to research, where a subject is either a natural person or an organization.
Thus, there are four basic concepts to deal with in this module, namely, Subject, Agent, Natural person, and Organization.
SUBJECT
A subject is any entity in the research domain which can act as an agent, and playing such role performs relevant activities in the research context. As we said, there are two types of subjects: natural persons (see below), and organization (see below).
17
AGENT
Any subject during his/her/its life can embody one or more agents, each of them operating during a given period of time. Thus, an agent is a role a subject assumes when carrying on relevant activities in the research domain. Notice that a subject can embody more than one agent at the same time, but there can be periods when a subject exists without embodying any
agent
The ontology considers the following types of agent. Each of them is described in detail in a
specific module of the overall ontology.
Table 3. Ontology of agents
Types of agent Description Sapientia module
Author a subject which has written some content to be published as a publication (for instance reporting the results of a research he/she has carried out)
Publishing
Caretaker a subject which takes care of things valuable in the research world
Caretaker
Degrees conferrer a subject which grants degrees allowing students to qualify themselves
Degrees conferrer
Editor a subject which oversees and coordinates a publication
where contributions of different authors need to be verified, harmonized and combined
Publishing
Examiner a subject which uses his/her/its competence to assess something
Examiner
Producer a subject which produces economic value Producer
Publisher a subject which provides some media to deliver and display publications
Publishing
Researcher a subject which attempts to advance the state of the art of knowledge
Researcher
Student a natural person who acquire knowledge to improve his/her educational qualification
Teacher
Supporter a subject which assigns or distributes funds to other agents
Supporter
Teacher a natural person who teaches some students Teacher
Source: Sapientia 1.1 DIAG Sapienza University of Rome.
Note that any subject which assumes the role of student or the role of teacher must be a natural person. An agent of specific types (e.g., researchers, teachers, etc.) can be affiliated to one or more organizations, where any affiliation is an instance of the concept Affiliation and has
specific properties (its duration, for example). Affiliation is a time-dependent concept.
NATURAL PERSON
A natural person is a specific type of subject, who can assume many roles in the research domain, thus embodying a number of agents (the most common are: author, editor, examiner, researcher, student and teacher). Furthermore, a natural person can be affiliate act in affiliations with one or more organizations, in particular, when some agents she embodies are
involved in that affiliation.
Any natural person has, at any moment of life, a socio-cultural-stage which is the formalized level of her ability of contributing in the research context. The socio-cultural-stage has two components: the educational qualification, and the career position.
An educational qualification is a degree that a natural person has achieved. Similarly, a career position is a career grade that a natural person has reached (or she is habilitated to reach). Notice that Educational_qualification and Career position are two concepts representing a
relationship where a natural person is involved in (the former with a degree, the latter with a career grade). The most socio-cultural-stages a person has reached the better she can contribute to the research (and the more significant is her responsibility). The sequence of career positions of a person is her career.
18
Considering the possible socio-cultural-stages, these are defined by two concepts: Degree and Career grade. Any instances of these concepts must have a national recognition in order to be accepted in a given nation. A Recognition is an act a Nation does when formalizing the validity of a socio-cultural-grade. Therefore, we have degree recognitions, granting the validity of a degree, and grade recognition, granting the validity of a career grade. Note that a
Grade_recognition can have a Degree_recognition as a prerequisite, and recognition can be related one to another (Inter_recognition). There are two kinds of inter-recognition:
Eq_recognition. Any of the required degrees is considered equivalent to the recognized degree: having one of them is the same of having the recognized degree.
Access_recognition. Any of the required degrees is not equivalent to the recognized degree, which still has to be achieves; having one of them, however, makes the achievement possible.
ORGANIZATION
An organization is a type of subject that can assume many roles in the research domain, embodying a number of agents (the most common are: degrees_conferrer, researcher, publisher, supporter, producer and caretaker).
An organization may be a legal entity, a subordinate organization or both. A subordinated organization is under the control of its parent organization, which delegates to the subordinated
organization part of its goals. Similarly, an affiliation is under the control of its parent organization, which delegates to the affiliated natural person part of its goals.
Among the legal entities, of particular concern are public administrations, enterprises, and technical and research institution. Notice that universities are technical and research institution that they have been playing a degrees conferrer agent, and Public research institutes are both public administrations, and technical and research institutions.
Table 4. Short definitions
term type definition
Affiliation Concept the condition of an agent when affiliated to an organization
has_degree_recognition Relation link a degree to its recognitions
has_parent Relation links every affiliation or subordinated organization to its parent organization
acts_in Relation links any person to her affiliations
has_recognizing_nation Relation links a recognition to the nations which grant it
Legal_entity Concept an organization which the law allows to act as if it were a single person for certain purposes
Organization Concept any social unit of people, structured and managed to meet a need or to pursue
collective goals. Organizations are open systems--they affect and are affected by their environment.
Recognition Concept any grant a nation gives to a Degree or a Career grade
Subject Concept any entity which can act as an agent and, playing such role, can perform some activities relevant in the research world
Source: Sapientia 1.1 DIAG Sapienza University of Rome.
2.2 The module Research
The research module aims at modelling the research activities of researchers and their products. The central concept of the module is Research activity linked to three crucial concepts
representing the research products: Research output, Research outcome, Technology transfer.
RESEARCH OUTPUTS AND OUTCOMES
19
Any research activity has:
its direct output (has_output), available without the contributions of any other activities,
its outcome (has_outcome): any output of any activity (not necessarily a research activity) participating a value chain where the research activity has an enabling role (i.e., without the
research activity that output would not be generated).
The figure shows the chains schemes that justify the outcome of the research activity shown in the left. The arrows represent the relation has_output. There are three particular kind of
research output considered as remarkable cases in the ontology:
method: the specification of means and modes;
abstract artefact: a role which can be played by an object and can be effectively recognized
Figure 1: The model Research. Source: Sapientia 1.1 DIAG Sapienza University of Rome.
The following table shows a list of research products considered in many papers about
research assessment giving the correspondence with a ontological specialization of Research_output and Research_outcome.
analyse Unreviewed_document Unreviewed_document
archive Software_artefact Software_artefact
artefact Physical_artefact Physical_artefact
assessment material Unreviewed_document Unreviewed_document
book chapter Book-like_content Chapter
books Book-like_content Monograph
building Physical_artefact Physical_artefact
case note Unreviewed_document Unreviewed_document
20
catalogue Software_artefact Software_artefact
chapter Book-like_content Chapter
code Unreviewed_document Unreviewed_document
Composition Event_class Event_class
conference paper Paper-like_content Paper
Confidential_report Report Report
Conservation - -
Creative writing Physical_artefact Physical_artefact
critical review article Survey Paper
data set Software_artefact Software_artefact
Database Software_artefact Software_artefact
design Unreviewed_document Unreviewed_document
design code Unreviewed_document Unreviewed_document
device Product Product
dictionary Dictionary Dictionary
digital broadcast media Media_content Media_content
electronic publication Paper-like_content Paper
electronic resource Software_artefact Software_artefact
evidence synthesis Unreviewed_document Unreviewed_document
exhibition - -
film Film Film
grammar Written_content Written_content
image Image Image
installation Event Event
intellectual property Patent-like_content Patent
Interactive tool Physical_object Physical_object
Journal articles (vedi punto 12 di cose da fare)
Paper-like_content Paper
map Map Map
material Material Material
meta-analyses Unreviewed_document Unreviewed_document
meta-syntheses Unreviewed_document Unreviewed_document
methodological and theoretical work
Content Content
molecule Material Material
monograph Book-like_content Monograph
multi-use data set Software_artefact Software_artefact
multilateral and international agencies research reports
Report Report
Museum catalogue Software_artefact Software_artefact
outputs from projects commissioned by all levels of government, industry and other research funding bodies
Report Report
paper in conference proceeding
Paper-like_content Paper
paper in peer-reviewed journal
Paper-like_content Paper
21
patent Patent-like_content Patent
patent application Patent-like_content Patent_application
performance - -
policy evaluation/reports
commissioned report
Report Report
primary data reports Report Report
Process Method Method
Product Product Product
prototype Prototype Prototype
publication of development donors
Unreviewed_document Unreviewed_document
published conference paper Paper-like_content Paper
research report Report Report
research reports to government departments,
charities, the voluntary sector, professional bodies,
industry or commerce
Report Report
research-based case studies Descriptive_content Descriptive_content
review articles Survey Paper
service Event_class Event_class
software package Software_artefact Software_artefact
Special issue Paper-like_content Paper
Standard Method Method
systematic review Unreviewed_document Unreviewed_document
technology appraisal Report Report
text books Book-like_content Monograph
textbook Book-like_content Monograph
therapy Method Method
Translation Unreviewed_document Unreviewed_document
video Film Film
Web content Software_artefact Software_artefact
work published in non print media
Paper-like_content Paper
working paper Unreviewed_document Unreviewed_document
Source: Sapientia 1.1 - DIAG, Sapienza University of Rome.
TECHNOLOGICAL TRANSFERS
A technology transfer is a particular type of research activity which generates, as its output, something more than knowledge (different from descriptive content). The notion of technology
transfer allows to consider the applied researches as compositions of research activities (not necessarily temporally ordered): each of them has one activity producing knowledge and one producing application results (the technology transfer).
2.3 The module Publishing
This module concerns publishing, the activity that allows people knowing the results of research. The output of a publishing activity is a publication, which is a way to represent a content through some media. Notice that a publication of a research work is not considered as output of
22
that work (the output of that work is the content of the publication).The central concepts of this module are: Publication, Content, and Reference.
PUBLICATION
A publication aims at reporting (at any level) empirical or theoretical work and describes the
results in some knowledge field. There are three kinds of agents involved in a publication:
Author: an author of a publication is an agent which has contributed in writing the content of the publication (for instance reporting the results of a research she has carried out);
Editor: an editor of a complex publication (where contributions of different authors need to be verified, harmonized and combined) is an agent which oversees and coordinates the publication;
Publisher: a publisher of a publication is the agent which provides some media to deliver and display a publication (the ontology considers as a traceable activity only the publisher activity).
There are three kinds of publications that are relevant in the research domain:
Atomic publications: a publication resulting from a unique, indivisible act of writing by one or more authors.
Collections: a publication disseminating a group of atomic publications in a unique impulse, during a limited and short period of time.
Series, each disseminating a group of atomic publications during a long and (perhaps) unlimited period of time. Notice that a series can disseminate its atomic publications in a
direct way or by disseminating collections.
Figure 2: The model Publishing (Source: Sapientia 1.1 DIAG Sapienza University of Rome.)
23
The figure above shows how our ontology considers the cycle of production and dissemination of knowledge.
Starting from the research box (right-middle of the figure). People carry out their research activities and produces content to be disseminated.
An author (typically) publishes his/her atomic publication (for example a paper)
This atomic publication is disseminated: it is transmitted in a collection or in a series
People read and study it and use the knowledge it is represented in as starting point for new
research
A patent application is a possible publication, output of an applied research. Notice that a patent is different from the other types of atomic publication: it is a right granted by a state which may concern a research output, not an output itself. A patent application follows its own path within the three levels of publications:
is an atomic publication itself,
it is published in an issue (a collection),
that issue appears in an Intellectual property law journal (a series)
CONTENT
There are three kinds of contents:
Paper like content (a content structured as for being published as paper)
Book like content (a content structured as being published as monographs or edited chapters)
Patent like content (a content structured as for being published as patent applications - see below)
Note that there are no constraint between contents and publications where they can be published (for example a patent_like_content can be placed in a part of a paper).
REFERENCES
Any new atomic publication provides references to previous published communications which have a bearing on the subject of the new publication. The purpose of the references is to allow
readers of the paper to refer to cited work assisting them in judging the new work, giving source background information, and acknowledging the contributions of earlier researchers.
Every atomic publication is conceptually divided in two parts: (i) the body of the new content (ii) the references to previous old publications.
Any reference in the reference part of an article is represented in the ontology as an instance of the concept Reference (in the figure is represented by a little red circle, whereas the
publication is represented by a little blue half-circle)
The participation of a reference R in a (atomic) publication A is represented by the role has_as_reference from A to R.
The (unique) publication B cited by R is represented by the role has_citation_from from B to R.
2.4 The module Space
The module Space aims at modelling the various space regions of interest in the research world. Figure 3 shows a diagrammatic representation of the module. We distinguish between spaces (regions of spaces), and places, where a place represents a usage of a particular space made by people. Correspondingly, the module is organized in two main submodules:
the space submodule
the place submodule
SPACE
24
The space submodule models essentially two notions: region of space, and point on earth.
Notice that a region of space is not necessarily on the Earths surface (it can be, for example, under the Earths surface or on the surface of the Moon).
A region of space may be a part of another region of space and a point of Earth may be located in a region of space. If a point is included in a region which is part of another region, that point is also included in the latter region. Every point on Earth has its coordinates: longitude, altitude, elevation.
PLACE
The conceptual organization of space, in our ontology, is based on the notion of place. A place is a role assigned to a region of space in recognizing an interest about the area itself. While a spatial area exists independently from human intervention, a place is a social object: it exists through a decision which confers to the region a needed role. There are four kinds of places in the ontology:
relevant place: the place is a potential source of investigation in research activities (e.g. the moon, the Iguazu falls, the ancient Naxos)
territory: the place is a political and economic unit where people live and produce (eg: Italy, Normandy).
site: the place is the location of something of interest (a research source, an activity or an asset) from the point of view of research (eg: the site of the Large Hadron Collider of CERN Laboratory)
In case 1 and 2 the role is directly attributed to the region of space: whatever is permanently placed within the region inherit the role, for it is in that region. In case 3 and 4 the role is indirectly attributed to the region of space through the interesting things it includes. So the region of space:
must be large enough to include all the objects characterizing the role;
must be small enough to characterize the place where they all are: there should be points in the region (entrances) which are to be considered, in terms of time and cost, equivalent to
reach those objects, being negligible the effort to cover the distance between the points and the objects.
Any object in a Site has, at a given time, its position in the site. The position in a site
(accessible through an entrance) characterizes the relation between an object and a site in a given period A residence is a particular position for organizations (eg: the residence of CERN research organization); see the agent module.
While the concept of place is static the concept of position is dynamic. Consider the following example.
Example
The figure shows the following facts:
the asset A1 changes its position from position 1 (in the place P1) to position 2 (in the place P2)
the asset A2 assumes the position 3 (in the place P1) after A1 has moved to position 2, leaving P1 free.
25
Figure 3: The module Space (Source: Sapientia 1.1 DIAG Sapienza University of Rome)
TERRITORY
A territory is a place where a certain administrative policy is applied. Note that the population and gross domestic product are characteristics of the site and not of the the region of space
filled by the place. The link between a person who inhabits a place and the place itself is conventional and not depending - in general - by the actual position of that person at a given time. The same applies to economic activities.
Since a territory is a role of a region of place it is thought of as including all social objects and social activities that are dependent on the role. For example a Nation is considered both as the country (a place in political geography) and the state who has sovereignty over the country. Similarly, a cluster system is a territory used as field for analyzing the economic, social, political
and institutional relationships that generate, within the boundary of the cluster itself, a collective learning process in a group of technological or functional areas.
As for European territories, the Nomenclature of Territorial Units for Statistics (NUTS) was drawn up by Eurostat more than 30 years ago in order to provide a single uniform breakdown of territorial units for the production of regional statistics for the European Union. Starting from 2003 it has been adopted by European Parliament. A particularly important goal of the
Regulation is to minimize the impact of changes in the national administrative structures on the
availability of comparable regional statistics.
NUTS is a hierarchical system for dividing up the economic territory of the EU for the purpose of collecting, developing and harmonizing of European regional statistics about:
major socio-economic European regions,
basic regions with respect to the application of regional policies,
small regions for specific diagnoses.
Different criteria have been used in subdividing national territories into regions. These are normally divided into normative and analytical criteria:
normative regions are the expression of political will; their limits are fixed according to the tasks allocated to the territorial communities, according to the sizes of population necessary
to carry out these tasks efficiently and economically, and according to historical, cultural and other factors;
analytical (or functional) regions are defined according to analytical requirements; they
group together zones using geographical criteria (e.g. altitude or type of soil) or using socio-economic criteria (e.g. homogeneity, complementarities, or polarity of regional economies).
For practical reasons related to data availability and the implementation of regional policies, the nomenclature is based primarily on the institutional divisions, currently in force in the Member States (normative criteria). In our ontology, the NUTS system is modeled in the following way:
26
Major, basic and small European regions are all distinct European territories.
All major European regions are identified by their NUTS1 code;
All basic European regions are parts of a major one: the link between any basic region and the major one including it is represented by the role hasNUTS1
All basic European regions are identified, among the major one they are included in, by their NUTS2 code
The NUTS1 code of a given major region is considered also a characteristic (NUTS1ref) of the
basic regions included in that major region
All small European regions are parts of a basic one: the link between any small region and the basic one including it is represented by the role hasNUTS2
All basic European regions are identified, among the basic one they are included in, by their NUTS3 code
The NUTS2 code of given basic region and and the NUTS1 code of the major regions
including that basic region are considered also a characteristic (NUTS2ref, NUTS1ref resp.) of the small regions included in the basic region.
2.5 The Author Affiliation View
As we said before, we now provide a description of a view over the Sapientia ontology. This view collects those concepts and relations that are relevant for the task of deciding author affiliation as found in atomic publications, and is illustrated in Appendix 3.
As one can easily verify, the Author Affiliation View is constituted by a set of concepts and relations that are present in the modules described above. The new aspect of the view is that such concepts and relations are organized in a simplified way, so as to highlight and gather links
that are immediately useful in our subsequent analysis. In particular, the following observations hold:
The relation has_affiliation appearing in the view is the specialization of the relation with the same name shown in the Agent module to the case where the Agent is an Author.
The relation in_city is a shortcut for the link between an Organization and the City of the Address where its Residence is located, where such link is the chain of the following relations
from Organization to City: has_state_of_organization, has_residence, has_position, has_entrance, is_in_city.
The relation has_publication_affiliation is a shortcut for the chain of the relations has_author and has_affiliation. Following this relation, one is able, for each paper, to reach the Organizations which the authors of the publication are affiliated to.
It is obvious that point 3 above is particularly important for the European Map of Excellence. Indeed, as already observed, the main general goal of the present study is to discuss the
problem of establishing a comprehensive list of research organizations, disambiguating the ways in which research organizations show up in bibliographic data, and locating the scientific production of research organizations at geographic level.
In terms of the Author Affiliation View of the ontology this means the following.
Establishing a comprehensive list of research organizations means to single out a mechanism allowing us to compile a sound and complete list of Organizations (in particular Public Research Institutions), together with its relevant properties (in particular the City of
residence). We will use the term "Authority File UNI" or "Master list for Organization" for such list. In other words, this aspect amounts to finding out which are the correct instances
of the concept Organization, where correct here means faithful to the real world we are modelling.
Disambiguating the ways in which research organizations show up in bibliographic data means:
to single out a mechanism allowing us to compile a sound and complete list of Atomic Publications. In other words, this aspect amounts to finding out the correct instances of the concept Atomic_publication, and
27
To single out a mechanism allowing us to find which are the Organizations that are to be associated with the various atomic publications, i.e., that corresponds to the affiliations of the authors of atomic publications. In other words, this aspect amounts to finding out the correct instances of the relation has_publication_affiliation, the correct instances of the concept Affiliation, and the correct instances of the relation has_involvment_in.
Indeed, it is the chain of has_publication_affiliation and has_involvment_in that allows us to link every atomic publication to the correct organizations.
Locating the scientific production of research organizations at geographic level means to be able to associate with each organization the geographical region where the organization resides. In other words, this aspect amounts to finding out the correct instances of the relation in_city.
Note that there are concepts and relations of the Author Affilation View for which we will not use their instances in the following, namely:
the concept Author
the relation has_affiliation.
Also, notice that we assume to have the set of correct instances of the following concepts and relations:
the concept Atomic_publication. Indeed, such instances are provided by the bibliographic databases;
the concept Affiliation_in_atomic_publication. Again, such instances are provided by the
bibliographic databases;
the set of City. Such instances can be collected from various reliable sources.
28
3. TECHNICAL SPECIFICATIONS FOR THE EUROPEAN MAP OF EXCELLENCE (EME)
As we said, the main general goal of the present study is to address three problems: (1)
establishing a comprehensive list of research organizations, in particular, Public Research Organizations - PROs, (2) disambiguating the ways in which research organizations (in particular PROs) show up in bibliographic data, and (3) locating the scientific production of research organizations (in particular PROs) at geographic level.
In section 3.1 we address problem 1 with regard to universities. In other words, we illustrate a
way to set up a data set that can be used to establish a high-quality mapping to the concept University of the ontology and their relevant properties. Therefore, section 3.1 reports on activity a) of the Technical and Financial Offer, the one regarding the creation of an Authority File Nomenclature of Units for Territorial Statistics (NUTS 2) in which all universities in EU 28 + Switzerland and Norway have a precise geographic location using GIS coordinates.
Section 3.2 addresses problem 2, by illustrating a number of issues arising in the disambiguation of institution affiliations in scientific publications.
Section 3.3 continues the study of problem 2, and illustrates what has been done in the GRBS (Global Research Benchmarking System) for academic institutions. In this sense, it therefore
reports on activity a) of the Technical and Financial Offer, the one concerning the disambiguation and validation of affiliations of academic institutions used in publications indexed in bibliometric databases. This activity has been carried out on the basis of recent projects on the topic (i.e. ETER and EUMIDA) and other research activities carried out at the Sapienza University of Rome based on one bibliometric database.
Section 3.3 and 3.4 address problem 2 and problem 3, and reports on activity c) of the Technical and Financial Offer, the one concerning the identification of public research institutions in affiliations used on publications indexed in bibliometric databases and assigning these to Nomenclature of Units for Territorial Statistics (NUTS2 and NUTS3) regions and the proposal of an approach for identifying PROs from affiliations of publications indexed in a bibliometric database and assigning to these the NUTS2 and NUTS3 regions. The approach has been tested
on a sample of raw data extracted from a bibliometric database for a large PRO.
3.1. Gathering data about European Universities
The possibility of geo-referencing information pertaining to excellence in S&T at NUTS2 and NUTS3 level of European universities was granted by the availability of the results of two projects promoted by the European Commission, respectively:
EUMIDA - Feasibility Study for Creating a European University Data Collection (Contract No. RTD/C/C4/2009/0233402) completed in 2010;
ETER - European Tertiary Education Register (Contract No. EAC2013038) to be completed in July 2015.
Integrating information provided by the two projects and filling in few information gaps it has been possible to create an Authority File of higher education institutions that deliver the PhD degree in the European Union and in EFTA countries (Iceland, Liechtenstein, Norway, Switzerland).
The file lists 1,131 Institutions which represent slightly 50% of all institutions contained in the European tertiary education register (the remaining ones are institution not awarding doctoral -
ISCED8- degree). For each institution the file reports the ETER code, the name in national
language and in English, the NUTS codes at level 2 and 3. Additional geographical information (name of the city, postcode, GIS coordinates) are available in the ETER database.
29
In principle, the location of universities within NUTS2 regions is quite straightforward given that multi-site institutions2 with activities spreading across regional borders are not very diffused. Nevertheless almost 10% of HEIs comprised in the Authority file have secondary branches located in two or more regions (very rarely abroad).
This share increases up to 24% if we look at the NUTS3. The share of multi-site institutions is actually increasing around Europe as a consequence of two opposite phenomena: the spread of universities activities from the original seat in the surrounding region (decentralisation and
wider regional coverage) on one hand; institutional concentration though the merger of small institutions and the creation of larger and more comprehensive HEIs which usually maintain at least partially the original seats and locations in different cities and regions.
At this stage it is not possible to disentangle information for multi-site institution, and the whole university activities are located in the region of the main seat. This could create a slight distortion especially when data are analysed by disciplinary area. It could happen that all
function in a disciplinary area are located in a secondary campus outside the region (i.e. the medical department of Universit Cattolica del Sacro Cuore is actually located in Rome: in the map of excellence figures will be attributed to ITC4 region instead of ITI43).
The creation of the Authority File is the result of the following analytical steps:
retrieving of the most recent information contained in ETER (the result of the first data
collection referring to year 2011 are publicly available online at eter.joanneum.at/imdas-eter); data about Hungary, Romania and Slovenia are missing;
retrieving of the missing data from the EUMIDA DC1 dataset publicly available for Hungary and Romania (with reference to year 2008), and from internal resources for Slovenia (reference year 2011 as for ETER);
revision and update of NUTS codes to the last version in order to allow for full interoperability with the Eurostat regional database;
The Authority File is available on request.
3.2 Problems in the disambiguation of institution affiliations in
scientific publications
Academic institutions constitute in most countries by far the most important type of research entities. A second important group of (mainly) publicly funded research institutions is labelled as
public research organizations (PROs). Public Research Organizations can be divided into 4 categories or ideal types (OECD Innovation Policy Platform 2011):
Mission Oriented Centres (MOCs), owned by government departments or ministries at a national level (e.g., INSERM in France; CIEMAT in Spain).
Public Research Centres (PRCs), publicly funded overarching research institutions such as CNRS in France, CNR in Italy, Max Planck Gesellschaft in Germany;
Research Technology Organizations (RTOs), often in the public sphere, private but not-for-profit, such as Fraunhofer Gesellschaft in Germany and TNO in the Netherlands
Independent Research Institutes (IRIs), often at the boundary between the public and the private sector, denoted as centre of excellence, and recently founded.
Several large disambiguation studies of the names of academic institutions of authors of millions of scientific publications have been conducted in the past, based on Thomson Reuters Web of Science (WoS) and Elseviers Scopus.
The richest brunch of literature on disambiguation in bibliometric studies has dealt with the analysis of the authors of scientific publications (author name disambiguation): see Ferreira et
al. (2012), Strotmann and Zhao (2012) and Milojevic (2013), for a summary on the existing state of the art on this subject. Another relevant branch of bibliometrics has investigated
2 According to the definition developed in ETER, multi-site institutions are defined as institutions with local establishments in NUTS3 region(s) that are different from the main seat even if the definition leaves a margin of flexibility to national statistical offices to adapt to national contexts.
30
research units and institutions as the main unit of analysis (affiliation disambiguation). For the purpose of this study we focus on the issues related with disambiguation of affiliation in bibliometric databases.
Although the problems of affiliation disambiguation are present in the literature since several
years (see the reconstruction and the analysis on technical and theoretical issues below), only few works have addressed these problems and proposed workable solutions. See the following Box 1 for an overview of the recent literature on this issue.
The disambiguation of institutional affiliations is part of the area entity resolution which addresses the general problem of identifying and linking/grouping different manifestations of the same real world object (Schaerf 2015; for an introduction see Talburt 2011).
This activity has encountered a series of problems within the bibliometric application framework. A distinction can be made between technical and theoretical issues. This section provides an overview of the main issues, and sketches the main lines of how these can be solved3. Although
almost all studies conducted in the past related to academic institutions, the issues that were encountered relate to the affiliation disambiguation of PROs as well. Additional comments are made that specifically relate to PROs.
3 This section is based on Moed (2015).
31
Box 1: Overview of the recent literature on affiliation disambiguation
Reference Content
Jiang et al., 2011 Context: conversion of publications affiliations into semantic web
data. Issues faced: affiliation ambiguity (different authors, who have the same affiliation, often express the affiliation in different ways). Solution proposed: a clustering method based on normalized compression distance.
Morillo et al., 2013 Context: standardisation of affiliations and addresses for the assessment of research production. Issues faced: to standardize or codify addresses, in order to produce bibliometric indicators from bibliographic databases. Solution proposed: A semi-automatic method which is supposed
to work with no previous existence of master lists or tables. The analysis is limited at the sectorial level (tested on a sample of 136,821 documents from WoS database), further research is envisaged to identify individual organisations.
Cuxac et al., 2013 Context: disambiguation of the affiliations of authors of scientific
papers in bibliographic databases at organisation level (laboratory, institute, university, research center). Application on a sample of French CNRS affiliations. Issues faced: high variability and heterogeneity of naming in large bibliometric databases. Solution proposed: Two approaches: the first way considers that
a training dataset is available, and uses a Naive Bayes model. The second way assumes that there is no learning resource, and uses a semi-supervised approach, mixing soft-clustering and Bayesian learning.
S. Huang et al., 2014 Context: To improve existing techniques of institution name
disambiguation based on word similarity or editing distance. Issues faced: high variability and heterogeneity of naming in large bibliometric databases.
Solution proposed: the paper propose a rule-based algorithm. One-to-many relationships between an institution and many variant names under which it is referred to in bylines of publications are recognized with the aid of statistical methods and
specific rules. The performance of the rule based institution name disambiguation algorithm is evaluated on large datasets in four fields. A test on metadata provided by the WoS shows that often basic structures, e.g. universitydepartmentstreet address, city, including ZIP code, can be recognized. The study proposes an algorithm for institution name mapping. No specific attention is
devoted to the explosion of the organisational level of institutions.
Main technical and theoretical issues
Technical issues
Researchers do not indicate their institutional affiliations in a uniform manner. Even authors
from the same institution or department may use different names. Moreover, naming conventions may change over time. Some institutions impose naming conventions upon their researchers, in order to increase the institutions visibility, also in university rankings. For these institutions the uniformity in the institutional names can be expected to be much larger
than for those in which such a policy is lacking4.
4 Another source of variability in affiliation data is the fact that publishers do not use uniform formats for the way in which institutional affiliations are recorded in a publication. For instance, some publishers in
32
Bibliographic database producers apply data capturing rules that re-format and modify the original affiliation data in the scientific articles they index: what is included in the database is not always identical to what was indicated in the original article. Producers of WoS and Scopus to some extent disambiguate institutional names; also, they insert a type of hierarchical order in the various components of an institutional name string, identifying the
so called main organization, but this process is far from perfect and may contain errors (e.g., Moed 2005). Precise estimates of their occurrence are unavailable.
Additional information sources are needed to comprehensively identify research institutions or organizations in a particular country. Normally research entities appear in many variations. Hence, when analysing the research output of a larger country, long lists of institutional names must be checked by analysts who must know what they are looking for in the data.
Theoretical issues
It is not always clear how research institutions should be defined. Should affiliated departments be considered a part of an organization? For instance, are academic or affiliated hospitals in all cases parts of universities? Should the components of a university system or umbrella organization such as University of Texas be aggregated or analysed separately? Background knowledge on the structure of individual institutions or organizations and of national research systems are indispensable. This means that a clear mapping should be
designed between the relevant data sources and the concept O