Download - Corroboration of the algorithm and the technical ... · and the technical specifications for the European Map of ... Corroboration of the algorithm and the technical specifications

Maurizio Lenzerini July 2015

EUR 27388 EN

Corroboration of the algorithm

and the technical specifications for the

European Map of Excellence

and Specialization (EMES)

EUROPEAN COMMISSION

Directorate-General for Research and Innovation Directorate A Policy Development and coordination Unit A6 Science Policy, foresight and data

Contact: Emanuele Barbarossa, Katarzyna Bitka

E-mail: [email protected]

[email protected]

[email protected]

[email protected]

European Commission B-1049 Brussels

mailto:[email protected]

EUROPEAN COMMISSION

Corroboration of the algorithm and the technical

specifications for the

European Map of Excellence and Specialization (EMES)

Maurizio Lenzerini

This document is based on projects carried out by the ONTORES research group at

Sapienza University of Rome. The contributions of Alessandro Bartolucci, Cinzia Daraio,

Camil Demetrescu, Emanuele Fusco, Claudio Leporelli, Henk F. Moed, and Paolo Naggar

are warmly acknowledged.

Directorate-General for Research and Innovation

2015 Research, Innovation, and Science Policy Experts High Level Group EUR 27388 EN

LEGAL NOTICE

This document has been prepared for the European Commission however it reflects the views only of the authors, and the Commission cannot be held responsible for any use which may be made of the information contained therein.

More information on the European Union is available on the internet (http://europa.eu).

Luxembourg: Publications Office of the European Union, 2015.

ISBN 978-92-79-50353-5

doi 10.2777/566251

ISSN 1831-9424

European Union, 2015. Reproduction is authorised provided the source is acknowledged.

EUROPE DIRECT is a service to help you find answers

to your questions about the European Union

Freephone number (*): 00 800 6 7 8 9 10 11

(*) The information given is free, as are most calls (though some operators, phone boxes or hotels may charge you)

4

Table of contents

TABLE OF CONTENTS ....................................................................................... 4

ABSTRACT ...................................................................................................... 6

EXECUTIVE SUMMARY ...................................................................................... 7

RESUM ......................................................................................................... 9

INTRODUCTION .............................................................................................. 11

1. AN OBDM APPROACH TO DESIGN AN EMES SUSTAINABLE OVER TIME .............. 12

1.1 Difficulties in accessing and managing distributed and heterogeneous

data ............................................................................................... 14

1.2 What is OBDM .................................................................................... 15

2. FUNDAMENTAL MODULES OF AN ONTOLOGY FOR EMES .................................. 16

2.1 The module Agent .............................................................................. 16

Subject ......................................................................................... 16

Agent ......................................................................................... 17

Natural person ................................................................................ 17

Organization ................................................................................... 18

2.2 The module Research .......................................................................... 18

Research outputs and outcomes ........................................................ 18

Technological transfers ..................................................................... 21

2.3 The module Publishing ........................................................................ 21

Publication ...................................................................................... 22

Content ......................................................................................... 23

References ...................................................................................... 23

2.4 The module Space .............................................................................. 23

Space ......................................................................................... 23

Place ......................................................................................... 24

Territory ......................................................................................... 25

2.5 The Author Affiliation View................................................................... 26

3. TECHNICAL SPECIFICATIONS FOR THE EUROPEAN MAP OF EXCELLENCE

(EME) ..................................................................................................... 28

3.1. Gathering data about European Universities ......................................... 28

3.2 Problems in the disambiguation of institution affiliations in scientific

publications .................................................................................... 29

Main technical and theoretical issues .................................................. 31

Key elements of the proposed affiliation disambiguation approach ......... 32

3.3 What has been done in the GRBS (Global Research Benchmarking

System) ......................................................................................... 35

3.4 Identification of PROs: Towards an authority file for PROs ....................... 37

3.5 Disambiguation of affiliations of European research institutes.

Description of a successful approach and the main steps of its

algorithm ........................................................................................ 37

3.6 An examination of the conditions under which the bibliometric

databases underlying the European Map of Excellence (EME) are

available for use in the EME project: problems and options ................... 44

4. ASSESSMENT ............................................................................................. 46

5. MAIN RECOMMENDATIONS .......................................................................... 47

REFERENCES ................................................................................................. 50

5

FURTHER DOCUMENTS .................................................................................... 53

APPENDICES .................................................................................................. 54

Appendix 1: Technical description of the algorithm for disambiguation of

affiliations of European research institutes .......................................... 54

Appendix 2: The Author Affiliation View of the ontology ................................ 62

6

ABSTRACT

We present a study whose goal is to deal with three problems:

Establishing a comprehensive list of research organizations, in particular, Public Research

Organizations,

Disambiguating the ways in which research organizations show up in bibliographic data, and

Locating the scientific production of research organizations at geographic level. We address the above issues by resorting to a so-called Ontology-Based Data Management approach, where a conceptual representation of the relevant concepts and relationships, called ontology, is used for modelling the data of the underlying information system. Using this approach, we illustrate some principles for designing an European Map of Excellence and Specialisation sustainable over time.

7

EXECUTIVE SUMMARY

The European scientific system is formed by two large segments: universities and Public Research Organisations (PROs). These two segments represent, in fact, the bulk of scientific publications. The scientific output of universities is the object of a large effort in the dedicated

literature, as well as in various types of university rankings. The European Commission itself has supported the launch of the U-Multirank exercise. In addition, the geographic location of university activity is made easier by the fact that universities are typically mono-plant organizations.

On the contrary, little is known, on a systematic and comparable basis, on the scientific output of Public Research Organisations. More specifically, the following issues are still open: (a) establishing a comprehensive list of PROs; (b) disambiguating the innumerable ways in which

PROs show up in bibliometric information and normalizing affiliations; (c) locating the scientific production of PROs at geographic level.

With respect to the list, it would be important to search for all European institutions beyond a given threshold of visibility in order to build up a repository.

Disambiguation is a very complex undertaking, since there are many ways in which the names of PROs appear (for example, in many cases only the name of the institute is visible, with the

name of the umbrella PROs not shown, or shown as an abbreviation, or within the overall name

of the institute etc.).

Finally, the geographic referentiation of publications of PROs is made difficult by the fact that the location of the laboratory or institute does not necessarily appear in the affiliations of the articles in bibliometric databases.

It is understood that all three elements will require a dedicated large scale effort in the future.

The current study has been carried out as a proof-of-concept, needed in order to verify the

feasibility and to provide the base for estimate the time and cost involved in a full scale exercise.

The study has critically examined the technical feasibility and validity of current existing methods and approaches to address the open issues recalled above.

The assessment of the feasibility has been based on the following criteria:

availability of data on publications: It is very important to clearly define the availability of data on publications in terms of sources, commercial conditions and regular updating.

degree of automation of the procedures. The application of the algorithm described in this study for a full scale exercise at European level does not present computational time problems. As showed in the related sections, some work is required for the improvement of the master list with particular attention to its maintenance and updating over time. Here an effort is needed to develop the data infrastructure able to improve and maintain the master list over time. The OBDM approach illustrated in this feasibility is a very promising tool that

we recommend to adopt.

operational conditions needed. As largely discussed in the study, it is important to involve universities and PROs branches (institutions) to check the list of variants of all institutions names found in publications names. Their involvement is essential also for checking and updating this list, developing a system of incentives for the involved research institutions to ensure their collaboration over time.

A summary of our recommendations follows.

Adopt an Ontology-Based Data Management (OBDM) approach to ensure replicability, extensions and updating of the system over time.

The availability of data on publications in terms of sources, commercial conditions and regular updating should be clearly discussed and settled. It is therefore of the utmost importance that the European Commission, informed by experts in the field, considers to start negotiating in an early stage with producers of large bibliographic databases about the conditions of use of these databases as sources in the creation and maintenance of the public

8

information system at stake, and reach with one or more producers an appropriate license agreement covering a sufficiently long time period.

The disambiguation is feasible, as illustrated in the study presented here. The main guidelines for its successful implementation can be summarized as follows.

General principles:

Use a validated thesaurus or authority file;

Use advanced disambiguation software;

Start from the raw affiliation data; do not use ill-understood features implemented by indexers;

Use a validated thesaurus or authority file; consult national experts;

Consult national and institutional experts; create an annotation file with background info on how a institution is defined;

Initiate further research into this issue. Examine use of information from funding acknowledgements

Technical features to consider:

For Higher Education Institutions, adopt the ID proposed by the ETER project, funded by DG Education and Culture in collaboration with DG Research and Eurostat.

A system of ID for the PROs and their hierarchical organization should be developed.

Compliance with existing standards related to the research system should be considered, by referring, for instance, to the following:

i. ORCID (http://orcid.org/) is a non-profit organization, supported by research organizations, agencies, providers of publication management systems, and publishers, aiming at giving all researchers a unique identifier (ORCID_id number) and keeping it persistent over time. Established at the end of 2009, but operational since end 2012, it has

almost reached one million researchers worldwide. Most of the increase has been achieved in a very short time frame: from 100,000 in March 2013 to almost 970,000 as of October 2014 (with 35% from European, Middle East and Asian countries);

ii. CERIF is a Europe-based initiative aiming at standardizing the

operations of funding agencies (http://www.eurocris.org);

iii. CASRAI (www.casrai.org) is a Canada-US initiative for the

standardization of data on research institutions and funders (also supported by a committee of Science Europe; http://www.scienceeurope.org/scientific-committees/Life-sciences/life-sciences-committee);

iv. ISNI (www.isni.org) provides lists and metadata on higher education, research, funding and many other types of organizations, while Ringgold

(www.ringgold.com) does the same in the world of publishers and intermediaries.

The Master list should be maintained at the official level by the Commission.

http://www.eurocris.org/http://www.scienceeurope.org/scientific-committees/Life-sciences/life-sciences-committeehttp://www.scienceeurope.org/scientific-committees/Life-sciences/life-sciences-committeehttp://www.isni.org/

9

RSUM

Le systme scientifique europenne est form par deux grands segments: les universits et les organismes publics de recherche (PRO). Ces deux segments reprsentent, en fait, la plus grande partie des publications scientifiques. La production scientifique des universits est l'objet

d'un grand effort dans la littrature spcialise, ainsi que dans divers types de classement des universits. La Commission Europenne elle-mme a soutenu le lancement de l'exercice U-Multirank. En outre, la situation gographique de l'activit universitaire est rendue plus facile par le fait que les universits sont gnralement des organisations "mono-plante", tant situ dans une seule ville dans environ 95% des cas.

Au contraire, on sait peu, sur une base systmatique et comparable, sur la production scientifique des organismes publics de recherche. Plus spcifiquement, les questions suivantes

sont encore ouverts:

tablir une liste exhaustive des PRO;

dsambiguser les innombrables manires dont les PRO se manifestent dans l'information bibliomtrique et affiliations normalisation;

la localisation de la production scientifique des PRO au niveau gographique.

En ce qui concerne la liste, il serait important de construire un rfrentiel. Aussi l'homonymie

est complexe, car il existe de nombreuses faons dans lequel les noms de PRO apparaissent. Enfin, la rfrenciation gographique des publications des PRO est difficile.

Donc, pour rsoudre ces trois questions, il faudra un effort spcifique.

L'tude actuelle a t ralise comme une preuve de concept, et a examin de faon critique la faisabilit technique et la validit des mthodes existantes et les approches actuelles pour rpondre aux questions ouvertes rappeles ci-dessus.

L'valuation de la faisabilit a t bas sur les critres suivants:

La disponibilit de donnes sur les publications: Il est trs important de dfinir clairement la disponibilit des donnes sur les publications en termes de sources, les conditions commerciales et mise jour rgulire.

Degr d'automatisation des procdures. L'application de l'algorithme dcrit dans cette tude pour un exercice grande chelle au niveau europen ne prsente pas de problmes de

temps de calcul.

Les conditions oprationnelles ncessaires. Comme largement discut dans l'tude, il est

important d'impliquer les universits et les pros branches (institutions) pour vrifier la liste des variantes de noms de tous les tablissements trouvs dans les noms de publications.

Un rsum de nos recommandations suit.

Adopter une approche de OBDM pour assurer rplicabilit, extensions et la mise jour du systme au fil du temps.

La disponibilit de donnes sur les publications en termes de sources, les conditions

commerciales et mise jour rgulire doit tre clairement discute et rgle. Il est donc de la plus haute importance que la Commission europenne, a inform par des experts dans le domaine, estime de commencer ngocier un stade prcoce avec les producteurs de grandes bases de donnes bibliographiques sur les conditions d'utilisation de ces bases de donnes en tant que sources dans la cration et l'entretien de la systme d'information du public en jeu, et d'atteindre avec un ou plusieurs producteurs, un accord de licence

approprie couvrant une priode de temps suffisamment longue.

La dsambigusation est possible, comme illustr dans l'tude prsente ici. Les principales lignes directrices pour sa mise en uvre russie peuvent tre rsumes comme suit.

Principes gnraux:

Utiliser un thsaurus valid ou fichier d'autorit;

Utiliser un logiciel d'homonymie avance;

10

Commencer partir des donnes d'affiliation brut; ne pas utiliser les fonctions mal comprises mises en uvre par les indexeurs;

Utiliser un thsaurus valid ou fichier d'autorit; consulter des experts nationaux;

Consulter des experts nationaux et institutionnels; crer un fichier d'annotation avec des

informations de base sur la faon dont un tablissement est dfinie;

Initier de nouvelles recherches sur cette question. Examiner l'utilisation des informations de remerciements de financement.

Caractristiques techniques prendre en considration:

Pour les tablissements d'enseignement suprieur, d'adopter l'ID propose par le projet ETER, financ par la DG Education et Culture, en collaboration avec la DG Recherche et Eurostat.

Un systme d'identification pour les pros et leur organisation hirarchique devrait tre labor. Conformit avec les normes existantes relatives au systme de recherche

devrait tre considre, en se rfrant, par exemple, ce qui suit:

ORCID (http://orcid.org/) est une organisation but non lucratif, soutenue par les organismes de recherche, les agences, les fournisseurs de systmes de gestion de la publication, et des diteurs, visant donner tous les chercheurs d'un identifiant unique (ORCID_id nombre) et de le garder persistante au fil du temps. Fonde la fin de 2009, mais oprationnel depuis la fin de 2012, il a presque atteint un million de

chercheurs du monde entier. La plupart de l'augmentation a t ralis dans un laps de

temps trs court: de 100.000 en Mars 2013 pour prs de 970 000 d'Octobre 2014 (avec 35% d'Europe, du Moyen-Orient et les pays d'Asie);

i. CERIF est une initiative de l'Europe visant standardiser les oprations des organismes de financement (http://www.eurocris.org);

ii. CASRAI (www.casrai.org) est une initiative canado-amricaine pour la normalisation des donnes sur les institutions de recherche et des bailleurs de fonds (galement soutenu par un comit de la science en

Europe; http://www.scienceeurope.org/scientific-committees/Life- Sciences / vie-sciences-comit);

iii. ISNI (www.isni.org) fournit des listes et des mtadonnes sur l'enseignement suprieur, de la recherche, du financement et de nombreux autres types d'organisations, tout en Ringgold (www.ringgold.com) fait la mme chose dans le monde des diteurs et

des intermdiaires.

La liste principale devrait tre maintenue au niveau officiel par la Commission.

11

INTRODUCTION

This document reports on a study whose main goal is to address three problems: (1) establishing a comprehensive list of research organizations, in particular, Public Research Organizations - PROs, (2) disambiguating the ways in which research organizations (in

particular PROs) show up in bibliographic data, and (3) locating the scientific production of research organizations (in particular PROs) at geographic level.

This document is structured as follows. In Section 1 we present an approach (called Ontology-based Data Management) to design an European Map of Excellence and Specialisation (EMES) sustainable over time. In Section 2 we briefly describe the most relevant modules of an ontology for EMES that we are currently developing at Sapienza University of Roma. In Section 3 we address the technical problems related to EMES.

In Section 4 we illustrate an assessment of the feasibility of the activities described in this document, whereas in Section 5 we report our main recommendations for the design of a sustainable information systems for EMES.

In the document, we will explicitly refer to the Technical Specification of the Tender, namely:

Creation of an Authority File Nomenclature of Units for Territorial Statistics (NUTS 2) in which all universities in EU 28 + Switzerland and Norway have a precise geographic location using

GIS coordinates. This authority file will be provided on the basis of recent projects carried out on the topic (i.e. ETER and EUMIDA) and other research activities realized at the Sapienza University of Rome.

Disambiguation and validation of affiliations of academic institutions used in publications indexed in bibliometric databases. This activity will be provided on the basis of recent projects carried out on the topic (i.e. ETER and EUMIDA) and other research activities carried out at the Sapienza University of Rome based on one bibliometric database.

Identification of public research institutions in affiliations used on publications indexed in bibliometric databases and assigning these to Nomenclature of Units for Territorial Statistics (NUTS2 and NUTS3) regions. Proposal of an approach for identifying PRI from affiliations of publications indexed in a bibliometric database and assigning to these the NUTS2 and NUTS3 regions. The approach will be tested on a sample of raw data extracted from a bibliometric database for a large PRO. A detailed description of the approach will be provided in order to replicate on a full scale the proposed approach.

An examination of the conditions under which the bibliometric databases underlying the EME

are available for use in the EME project.

12

1. AN OBDM APPROACH TO DESIGN AN EMES SUSTAINABLE OVER TIME

The quantitative analysis of Science and Technology is becoming a big data science, with an increasing level of computerization, in which large and heterogeneous datasets on various aspects are combined. In this context, understanding and formally specifying the meaning of data is of paramount importance.1

Within this framework, optimistic views, supporting the end of theory in favour of data-driven science (Kitchin 2014), have been opposed to more critical positions in favour of theory-driven

scientific discoveries (Frick 2014). It has been rightly highlighted that Data are not simply addenda or second-order artifacts; rather, they are the heart of much of the narrative literature, the protean stuff that allows for inference, interpretation, theory building, innovation, and invention (Cronin, 2013, p. 435). Moreover, the need for accountability of Science, Technology and Innovation (STI) activities to sustain their funding in the current difficult economic and financial situation is increasingly asking for rigorous empirical evidence to support informed policy making.

The needs to overcome the logic of rankings and the new trends in indicators development, including granularity and cross-referencing, can be explored and exploited in open data

platforms with a clear description of the main concepts of the domain (Daraio & Bonaccorsi 2014). The multidimensionality of research assessment and scholarly impact (Moed & Halevi 2015), and the recent altmetrics movements (Cronin & Sugimoto 2014), are questioning the traditional approach in indicators development.

Research assessment, indeed, is becoming increasingly complex due to its multi-dimensionality

nature. A Report published in 2010 by the Expert Group on the Assessment of University-Based Research, installed by the European Commission proposed a consolidated multidimensional methodological approach addressing the various user needs, interests and purposes, and identifying data and indicator requirements (AUBR 2010, p. 10). A key notion holds that indicators designed to meet a particular objective or inform one target group may not be adequate for other purposes or target groups. Diverse institutional missions, and different

policy environments and objectives require different assessment processes and indicators. In addition, the range of people and organizations requiring information about university based research is growing. Each group has specific but also overlapping requirements (AUBR 2010, p. 51).

Printed outputs (texts) Non-printed outputs (non-text)

Main type of impact

Scientific journal paper; book

chapter; scholarly monograph

Research data file; video of experiment; software

Scientific-scholarly

Patent; commissioned research report;

New product or process;

material; device; design; image; spin off

Economic or technological

Professional guidelines;

newspaper article; communication submitted to social media, including blogs, tweets.

Interview; event; art

performance; exhibit; artwork; scientific-scholarly advise;

Social or cultural

A research assessment has to take into account a range of different types of research output and impact. As regards output forms, one important distinction is between text-based and non-

text based output forms. The main types are presented in Table 1. This table is not fully

comprehensive. The specifications of the Panel Criteria in the Research Excellence Framework in the UK (REF 2012, page 51 a.f.) provide more detailed lists of possible output forms arranged by major research discipline. Table 1 includes forms that are becoming increasingly important such as research data files, and communications submitted to social media and scholarly blogs.

1 This section is taken from Lenzerini (2015) and Daraio, Lenzerini et al. (2015), to which the reader is

referred for the references therein.

Table 1: Main types of research outputs (source: Daraio, Lenzerini et al, 2015)

13

A framework for the assessment of these forms is being developed in the field of altmetrics (e.g., Taylor 2013). The last column indicates the main types of impact a particular output may have. A distinction is made between scientific-scholarly impact, and more wider impact outside the domain of science and scholarship, denoted as societal, a concept that embraces technological, economic, social and cultural impact. A comprehensive overview of the types of

impact, and the most frequently used impact indicators is presented in Table 2. The reader is referred to AUBR (2010) and Moed & Halevi (2015) for a further discussion of this table.

Type of impact Short Description; Typical examples Indicators (examples)

Scientific-scholarly or academic

Knowledge

growth

Contribution to scientific-scholarly

progress: creation of new scientific

knowledge

Indicators based on publications

and citations in peer-reviewed

journals and books

Research

networks

Integration in (inter)national scientific-

scholarly networks and research teams

(inter)national collaborations

including co-authorships;

participation in emerging topics

Publication

outlets

Effectiveness of publication strategies;

visibility and quality of used publication

outlets

Journal impact factors and other

journal metrics; diversity of used

outlets;

Societal

Social Stimulating new approaches to social

issues; informing public debate and

improve policymaking; informing

practitioners and improving

professional practices; providing

external users with useful knowledge;

Improving peoples health and quality

of life; Improvements in environment

and lifestyle;

Citations in medical guidelines

or policy documents to

research articles

Funding received from end-

users

End-user esteem (e.g.,

appointments in

(inter)national organizations,

advisory committees)

Juried selection of artworks for

exhibitions

Mentions of research work in

social media

Technological Creation of new technologies (products

and services) or enhancement of

existing ones based on scientific

research

Citations in patents to the

scientific literature (journal

articles)

Economic Improved productivity; adding to

economic growth and wealth creation;

enhancing the skills base; increased

innovation capability and global

competitiveness; uptake of recycling

techniques;

Revenues created from the

commercialization of research

generated intellectual property

(IP)

Number patents, licenses,

spin-offs

Number of PhD and equivalent

research doctorates

Employability of PhD

graduates

Table 2: Types of Research Impact and Indicators (source: Daraio, Lenzerini et al. 2015)

14

Cultural Supporting greater understanding of

where we have come from, and who

and what we are; bringing new ideas

and new modes of experience to the

nation.

Media (e.g. TV) performances

Essays on scientific

achievements in newspapers

and weeklies

Mentions of research work in

social media

It is also important to include the inputs in the analysis; they should be jointly analysed with the outputs to assess the overall impact of the process (see e.g. Daraio et al. 2014, for a conditional multidimensional approach to rank higher education institutions).

To meet all these new trends and policy needs a shift in the paradigm of data integration for research assessment is needed. In this paper we advocate an OBDM approach to research

assessment.

1.1 Difficulties in accessing and managing distributed and

heterogeneous data

While the amount of data stored in current information systems and the processes making use

of such data continuously grow, turning these data into information, and governing both data

and processes are still tremendously challenging tasks for Information Technology. The problem is complicated due to the proliferation of data sources and services both within a single organization, and in cooperating environments. The following factors explain why such a proliferation constitutes a major problem with respect to the goal of carrying out effective data governance tasks:

Although the initial design of a collection of data sources and services might be adequate,

corrective maintenance actions tend to re-shape them into a form that often diverges from the original conceptual structure.

It is common practice to change a data source (e.g., a database) so as to adapt it both to specific application-dependent needs, and to new requirements. The result is that data sources often become data structures coupled to a specific application (or, a class of applications), rather than application-independent databases.

The data stored in different sources and the processes operating over them tend to be

redundant, and mutually inconsistent, mainly because of the lack of central, coherent and

unified coordination of data management tasks.

The result is that information systems of medium and large organizations are typically structured according to a sylos-based architecture, constituted by several, independent, and distributed data sources, each one serving a specific application. This poses great difficulties with respect to the goal of accessing data in a unified and coherent way. Analogously, processes relevant to the organizations are often hidden in software applications, and a formal, up-to-date

description of what they do on the data and how they are related with other processes is often missing. The introduction of service-oriented architectures is not a solution to this problem per se, because the fact that data and processes are packed into services is not sufficient for making the meaning of data and processes explicit. Indeed, services become other artifacts to document and maintain, adding complexity to the governance problem. Analogously, data warehousing techniques and the separation they advocate between the management of data for the operation level, and data for the decision level, do not provide solutions to this challenge.

On the contrary, they also add complexity to the system, by replicating data in different layers of the system, and introducing synchronization processes across layers. All the above observations show that a unified access to data and an effective governance of processes and

services are extremely difficult goals to achieve in modern information systems. Yet, both are crucial objectives for getting useful information out of the information system, as well as for taking decisions based on them.

This explains why organizations spend a great deal of time and money for the understanding, the governance, the management, and the integration of data stored in different sources, and of the processes/services that operate on them, and why this problem is often cited as a key and costly Information Technology challenge faced by medium and large organizations today (Bernstein & Haas, 2008). We argue that ontology-based data management (OBDM, Lenzerini 2011) is a promising direction for addressing the above challenges.

15

1.2 What is OBDM

The key idea of OBDM is to resort to a three-level architecture, constituted by the ontology, the

sources, and the mapping between the two. The ontology is a conceptual, formal description of the domain of interest to a given organization (or, a community of users), expressed in terms of

relevant concepts, attributes of concepts, relationships between concepts, and logical assertions characterizing the domain knowledge. The data sources are the repositories accessible by the organization where data concerning the domain are stored. In the general case, such repositories are numerous, heterogeneous, each one managed and maintained independently

from the others. The mapping is a precise specification of the correspondence between the data contained in the data sources and the elements of the ontology.

The main purpose of an OBDM system is to allow information consumers to query the data using the elements in the ontology as predicates. In this sense, OBDM can be seen as a form of information integration, where the usual global scheme is replaced by the conceptual model of the application domain, formulated as an ontology expressed in a logic-based language. With this approach, the integrated view that the system provides to information consumers is not

merely a data structure accommodating the various data at the sources, but a semantically rich description of the relevant concepts in the domain of interest, as well as the relationships between such concepts. The distinction between the ontology and the data sources reflects the separation between the conceptual level, the one presented to the client, and the logical/physical level of the information system, the one stored in the sources, with the mapping

acting as the reconciling structure between the two levels. This separation brings several potential advantages:

The ontology layer in the architecture is the obvious mean for pursuing a declarative approach to information integration, and, more generally, to data governance. By making the representation of the domain explicit, we gain re-usability of the acquired knowledge, which is not achieved when the global schema is simply a unified description of the underlying data sources.

The mapping layer explicitly specifies the relationships between the domain concepts on the

one hand and the data sources on the other hand. Such a mapping is not only used for the operation of the information system, but also for documentation purposes. The importance of this aspect clearly emerges when looking at large organisations where the information about data is widespread into separate pieces of documentation that are often difficult to access and rarely conforming to common standards. The ontology and the corresponding mappings to the data sources provide a common ground for the documentation of all the data in the organisation, with obvious advantages for the governance and the management of the

information system.

A third advantage has to do with the extensibility of the system. One criticism that is often raised to data integration is that it requires merging and integrating the source data in advance, and this merging process can be very costly. However, the ontology-based approach we advocate does not impose to fully integrate the data sources at once. Rather, after building even a rough skeleton of the domain model, one can incrementally add new data sources or new elements therein, when they become available, or when needed, thus

amortising the cost of integration. Therefore, the overall design can be regarded as the incremental process of understanding and representing the domain, the available data sources, and the relationships between them. The goal is to support the evolution of both the ontology and the mappings in such a way that the system continues to operate while evolving, along the lines of "pay-as-you-go" data integration.

The notions of ODBM were introduced in (Calvanese et al. 2007; Poggi et al. 2008, Lenzerini

2011), and originated from several disciplines, in particular, Information Integration, Knowledge Representation and Reasoning, and Incomplete and Deductive Databases. The central notion of OBDM is therefore the ontology, and reasoning over the ontology is at the basis of all the tasks that an OBDM system has to carry out. In particular, the axioms of the ontology allow one to

derive new facts from the source data, and these inferred facts greatly influence the set of answers that the system should compute during query processing. In the last decades, research on ontology languages and ontology inferencing has been very active in the area of Knowledge

Representation and Reasoning. Description Logics (DLs, Baader et al. 2007) are widely recognized as appropriate logics for expressing ontologies, and are at the basis of the W3C standard ontology language OWL. These logics permit the specification of a domain by providing the definition of classes and by structuring the knowledge about the classes using a rich set of logical operators. They are decidable fragments of mathematical logic, resulting from extensive investigations on the trade-off between expressive power of Knowledge Representation languages, and computational complexity of reasoning tasks. Indeed, the constructs appearing

16

in the DLs used in OBDI are carefully chosen taking into account such a trade-off (Calvanese et al. 2007). As indicated above, the axioms in the ontology can be seen as semantic rules that are used to complete the knowledge given by the raw facts determined by the data in the sources. In this sense, the source data of an OBDI system can be seen as an incomplete database, and query answering can be seen as the process of computing the answers logically deriving from

the combination of such incomplete knowledge and the ontology axioms. Therefore, at least conceptually, there is a connection between OBDM and the two areas of incomplete information

and deductive databases (Ceri et al. 1990).

2. FUNDAMENTAL MODULES OF AN ONTOLOGY FOR EMES

In this section we describe a first draft of an ontology for EMES, that we call Sapientia (here, we refer to the beta version of Sapientia 1.0). This first draft should be seen as a first step towards

a complete ontology for research assessment, and is constituted by four modules, each one centered around one of the following concepts: Agent, Research, Publishing, and Space.

We observe that Sapientia in under development at the "Dipartimento di Ingegneria Informatica Automatica e Gestionale" of Sapienza Universit di Roma. What we illustrate here is a first, incomplete draft whose aim is to provide some evidence of the fact that an ontology for research assessment can be done, and to show some of the benefits that such an ontology can

provide. The draft is incomplete for two main reasons:

not all the relevant concepts are included in the modules that we present, and not all relevant characteristics of the concepts are included; this is due both to the lack of space, and to the fact that our ontology is not yet complete;

the properties of the concepts that we model here are only static properties. In the full version of the ontology, also dynamic properties (those varying in time) are modelled, by introducing suitable time-dependent relations associated to the concepts.

The modules are specified in OWL 2, the standard logic-based ontology language promoted by the W3C. In order to provide an intuitive account of the various modules, we describe them in term of a diagrammatic representation.

The language for expressing the ontology is Graphol, designed at the "Dipartimento di Ingegneria Informatica Automatica e Gestionale" of Sapienza Universit di Roma. A tutorial on this language is available at

http://www.dis.uniroma1.it/~graphol/documentation/GrapholIntro.pdf.

The modules expressed in Graphol are available on request.

In what follows, we devote one subsection for each of the above mentioned modules. After such subsections, we also present a view of the ontology that is particularly relevant for the remaining sections of this document. The view collects various concepts that are present in the four modules, and present them in a single diagram, with the goal of highlighting the relevant relations. The main concept characterizing the view is Affiliation, and for this reason we simply

call the view "Author Affiliation View".

2.1 The module Agent

This module concerns the agents that are relevant in our domain. An agent is any subject operating in the research world, and this module aims at describing and classifying the various types of agents. More precisely, an agent is a role that a subject assumes when carrying on activities related to research, where a subject is either a natural person or an organization.

Thus, there are four basic concepts to deal with in this module, namely, Subject, Agent, Natural person, and Organization.

SUBJECT

A subject is any entity in the research domain which can act as an agent, and playing such role performs relevant activities in the research context. As we said, there are two types of subjects: natural persons (see below), and organization (see below).

17

AGENT

Any subject during his/her/its life can embody one or more agents, each of them operating during a given period of time. Thus, an agent is a role a subject assumes when carrying on relevant activities in the research domain. Notice that a subject can embody more than one agent at the same time, but there can be periods when a subject exists without embodying any

agent

The ontology considers the following types of agent. Each of them is described in detail in a

specific module of the overall ontology.

Table 3. Ontology of agents

Types of agent Description Sapientia module

Author a subject which has written some content to be published as a publication (for instance reporting the results of a research he/she has carried out)

Publishing

Caretaker a subject which takes care of things valuable in the research world

Caretaker

Degrees conferrer a subject which grants degrees allowing students to qualify themselves

Degrees conferrer

Editor a subject which oversees and coordinates a publication

where contributions of different authors need to be verified, harmonized and combined

Publishing

Examiner a subject which uses his/her/its competence to assess something

Examiner

Producer a subject which produces economic value Producer

Publisher a subject which provides some media to deliver and display publications

Publishing

Researcher a subject which attempts to advance the state of the art of knowledge

Researcher

Student a natural person who acquire knowledge to improve his/her educational qualification

Teacher

Supporter a subject which assigns or distributes funds to other agents

Supporter

Teacher a natural person who teaches some students Teacher

Source: Sapientia 1.1 DIAG Sapienza University of Rome.

Note that any subject which assumes the role of student or the role of teacher must be a natural person. An agent of specific types (e.g., researchers, teachers, etc.) can be affiliated to one or more organizations, where any affiliation is an instance of the concept Affiliation and has

specific properties (its duration, for example). Affiliation is a time-dependent concept.

NATURAL PERSON

A natural person is a specific type of subject, who can assume many roles in the research domain, thus embodying a number of agents (the most common are: author, editor, examiner, researcher, student and teacher). Furthermore, a natural person can be affiliate act in affiliations with one or more organizations, in particular, when some agents she embodies are

involved in that affiliation.

Any natural person has, at any moment of life, a socio-cultural-stage which is the formalized level of her ability of contributing in the research context. The socio-cultural-stage has two components: the educational qualification, and the career position.

An educational qualification is a degree that a natural person has achieved. Similarly, a career position is a career grade that a natural person has reached (or she is habilitated to reach). Notice that Educational_qualification and Career position are two concepts representing a

relationship where a natural person is involved in (the former with a degree, the latter with a career grade). The most socio-cultural-stages a person has reached the better she can contribute to the research (and the more significant is her responsibility). The sequence of career positions of a person is her career.

18

Considering the possible socio-cultural-stages, these are defined by two concepts: Degree and Career grade. Any instances of these concepts must have a national recognition in order to be accepted in a given nation. A Recognition is an act a Nation does when formalizing the validity of a socio-cultural-grade. Therefore, we have degree recognitions, granting the validity of a degree, and grade recognition, granting the validity of a career grade. Note that a

Grade_recognition can have a Degree_recognition as a prerequisite, and recognition can be related one to another (Inter_recognition). There are two kinds of inter-recognition:

Eq_recognition. Any of the required degrees is considered equivalent to the recognized degree: having one of them is the same of having the recognized degree.

Access_recognition. Any of the required degrees is not equivalent to the recognized degree, which still has to be achieves; having one of them, however, makes the achievement possible.

ORGANIZATION

An organization is a type of subject that can assume many roles in the research domain, embodying a number of agents (the most common are: degrees_conferrer, researcher, publisher, supporter, producer and caretaker).

An organization may be a legal entity, a subordinate organization or both. A subordinated organization is under the control of its parent organization, which delegates to the subordinated

organization part of its goals. Similarly, an affiliation is under the control of its parent organization, which delegates to the affiliated natural person part of its goals.

Among the legal entities, of particular concern are public administrations, enterprises, and technical and research institution. Notice that universities are technical and research institution that they have been playing a degrees conferrer agent, and Public research institutes are both public administrations, and technical and research institutions.

Table 4. Short definitions

term type definition

Affiliation Concept the condition of an agent when affiliated to an organization

has_degree_recognition Relation link a degree to its recognitions

has_parent Relation links every affiliation or subordinated organization to its parent organization

acts_in Relation links any person to her affiliations

has_recognizing_nation Relation links a recognition to the nations which grant it

Legal_entity Concept an organization which the law allows to act as if it were a single person for certain purposes

Organization Concept any social unit of people, structured and managed to meet a need or to pursue

collective goals. Organizations are open systems--they affect and are affected by their environment.

Recognition Concept any grant a nation gives to a Degree or a Career grade

Subject Concept any entity which can act as an agent and, playing such role, can perform some activities relevant in the research world

Source: Sapientia 1.1 DIAG Sapienza University of Rome.

2.2 The module Research

The research module aims at modelling the research activities of researchers and their products. The central concept of the module is Research activity linked to three crucial concepts

representing the research products: Research output, Research outcome, Technology transfer.

RESEARCH OUTPUTS AND OUTCOMES

19

Any research activity has:

its direct output (has_output), available without the contributions of any other activities,

its outcome (has_outcome): any output of any activity (not necessarily a research activity) participating a value chain where the research activity has an enabling role (i.e., without the

research activity that output would not be generated).

The figure shows the chains schemes that justify the outcome of the research activity shown in the left. The arrows represent the relation has_output. There are three particular kind of

research output considered as remarkable cases in the ontology:

method: the specification of means and modes;

abstract artefact: a role which can be played by an object and can be effectively recognized

Figure 1: The model Research. Source: Sapientia 1.1 DIAG Sapienza University of Rome.

The following table shows a list of research products considered in many papers about

research assessment giving the correspondence with a ontological specialization of Research_output and Research_outcome.

analyse Unreviewed_document Unreviewed_document

archive Software_artefact Software_artefact

artefact Physical_artefact Physical_artefact

assessment material Unreviewed_document Unreviewed_document

book chapter Book-like_content Chapter

books Book-like_content Monograph

building Physical_artefact Physical_artefact

case note Unreviewed_document Unreviewed_document

20

catalogue Software_artefact Software_artefact

chapter Book-like_content Chapter

code Unreviewed_document Unreviewed_document

Composition Event_class Event_class

conference paper Paper-like_content Paper

Confidential_report Report Report

Conservation - -

Creative writing Physical_artefact Physical_artefact

critical review article Survey Paper

data set Software_artefact Software_artefact

Database Software_artefact Software_artefact

design Unreviewed_document Unreviewed_document

design code Unreviewed_document Unreviewed_document

device Product Product

dictionary Dictionary Dictionary

digital broadcast media Media_content Media_content

electronic publication Paper-like_content Paper

electronic resource Software_artefact Software_artefact

evidence synthesis Unreviewed_document Unreviewed_document

exhibition - -

film Film Film

grammar Written_content Written_content

image Image Image

installation Event Event

intellectual property Patent-like_content Patent

Interactive tool Physical_object Physical_object

Journal articles (vedi punto 12 di cose da fare)

Paper-like_content Paper

map Map Map

material Material Material

meta-analyses Unreviewed_document Unreviewed_document

meta-syntheses Unreviewed_document Unreviewed_document

methodological and theoretical work

Content Content

molecule Material Material

monograph Book-like_content Monograph

multi-use data set Software_artefact Software_artefact

multilateral and international agencies research reports

Report Report

Museum catalogue Software_artefact Software_artefact

outputs from projects commissioned by all levels of government, industry and other research funding bodies

Report Report

paper in conference proceeding


paper in peer-reviewed journal


21

patent Patent-like_content Patent

patent application Patent-like_content Patent_application

performance - -

policy evaluation/reports

commissioned report

Report Report

primary data reports Report Report

Process Method Method

Product Product Product

prototype Prototype Prototype

publication of development donors

Unreviewed_document Unreviewed_document

published conference paper Paper-like_content Paper

research report Report Report

research reports to government departments,

charities, the voluntary sector, professional bodies,

industry or commerce

Report Report

research-based case studies Descriptive_content Descriptive_content

review articles Survey Paper

service Event_class Event_class

software package Software_artefact Software_artefact

Special issue Paper-like_content Paper

Standard Method Method

systematic review Unreviewed_document Unreviewed_document

technology appraisal Report Report

text books Book-like_content Monograph

textbook Book-like_content Monograph

therapy Method Method

Translation Unreviewed_document Unreviewed_document

video Film Film

Web content Software_artefact Software_artefact

work published in non print media


working paper Unreviewed_document Unreviewed_document

Source: Sapientia 1.1 - DIAG, Sapienza University of Rome.

TECHNOLOGICAL TRANSFERS

A technology transfer is a particular type of research activity which generates, as its output, something more than knowledge (different from descriptive content). The notion of technology

transfer allows to consider the applied researches as compositions of research activities (not necessarily temporally ordered): each of them has one activity producing knowledge and one producing application results (the technology transfer).

2.3 The module Publishing

This module concerns publishing, the activity that allows people knowing the results of research. The output of a publishing activity is a publication, which is a way to represent a content through some media. Notice that a publication of a research work is not considered as output of

22

that work (the output of that work is the content of the publication).The central concepts of this module are: Publication, Content, and Reference.

PUBLICATION

A publication aims at reporting (at any level) empirical or theoretical work and describes the

results in some knowledge field. There are three kinds of agents involved in a publication:

Author: an author of a publication is an agent which has contributed in writing the content of the publication (for instance reporting the results of a research she has carried out);

Editor: an editor of a complex publication (where contributions of different authors need to be verified, harmonized and combined) is an agent which oversees and coordinates the publication;

Publisher: a publisher of a publication is the agent which provides some media to deliver and display a publication (the ontology considers as a traceable activity only the publisher activity).

There are three kinds of publications that are relevant in the research domain:

Atomic publications: a publication resulting from a unique, indivisible act of writing by one or more authors.

Collections: a publication disseminating a group of atomic publications in a unique impulse, during a limited and short period of time.

Series, each disseminating a group of atomic publications during a long and (perhaps) unlimited period of time. Notice that a series can disseminate its atomic publications in a

direct way or by disseminating collections.

Figure 2: The model Publishing (Source: Sapientia 1.1 DIAG Sapienza University of Rome.)

23

The figure above shows how our ontology considers the cycle of production and dissemination of knowledge.

Starting from the research box (right-middle of the figure). People carry out their research activities and produces content to be disseminated.

An author (typically) publishes his/her atomic publication (for example a paper)

This atomic publication is disseminated: it is transmitted in a collection or in a series

People read and study it and use the knowledge it is represented in as starting point for new

research

A patent application is a possible publication, output of an applied research. Notice that a patent is different from the other types of atomic publication: it is a right granted by a state which may concern a research output, not an output itself. A patent application follows its own path within the three levels of publications:

is an atomic publication itself,

it is published in an issue (a collection),

that issue appears in an Intellectual property law journal (a series)

CONTENT

There are three kinds of contents:

Paper like content (a content structured as for being published as paper)

Book like content (a content structured as being published as monographs or edited chapters)

Patent like content (a content structured as for being published as patent applications - see below)

Note that there are no constraint between contents and publications where they can be published (for example a patent_like_content can be placed in a part of a paper).

REFERENCES

Any new atomic publication provides references to previous published communications which have a bearing on the subject of the new publication. The purpose of the references is to allow

readers of the paper to refer to cited work assisting them in judging the new work, giving source background information, and acknowledging the contributions of earlier researchers.

Every atomic publication is conceptually divided in two parts: (i) the body of the new content (ii) the references to previous old publications.

Any reference in the reference part of an article is represented in the ontology as an instance of the concept Reference (in the figure is represented by a little red circle, whereas the

publication is represented by a little blue half-circle)

The participation of a reference R in a (atomic) publication A is represented by the role has_as_reference from A to R.

The (unique) publication B cited by R is represented by the role has_citation_from from B to R.

2.4 The module Space

The module Space aims at modelling the various space regions of interest in the research world. Figure 3 shows a diagrammatic representation of the module. We distinguish between spaces (regions of spaces), and places, where a place represents a usage of a particular space made by people. Correspondingly, the module is organized in two main submodules:

the space submodule

the place submodule

SPACE

24

The space submodule models essentially two notions: region of space, and point on earth.

Notice that a region of space is not necessarily on the Earths surface (it can be, for example, under the Earths surface or on the surface of the Moon).

A region of space may be a part of another region of space and a point of Earth may be located in a region of space. If a point is included in a region which is part of another region, that point is also included in the latter region. Every point on Earth has its coordinates: longitude, altitude, elevation.

PLACE

The conceptual organization of space, in our ontology, is based on the notion of place. A place is a role assigned to a region of space in recognizing an interest about the area itself. While a spatial area exists independently from human intervention, a place is a social object: it exists through a decision which confers to the region a needed role. There are four kinds of places in the ontology:

relevant place: the place is a potential source of investigation in research activities (e.g. the moon, the Iguazu falls, the ancient Naxos)

territory: the place is a political and economic unit where people live and produce (eg: Italy, Normandy).

site: the place is the location of something of interest (a research source, an activity or an asset) from the point of view of research (eg: the site of the Large Hadron Collider of CERN Laboratory)

In case 1 and 2 the role is directly attributed to the region of space: whatever is permanently placed within the region inherit the role, for it is in that region. In case 3 and 4 the role is indirectly attributed to the region of space through the interesting things it includes. So the region of space:

must be large enough to include all the objects characterizing the role;

must be small enough to characterize the place where they all are: there should be points in the region (entrances) which are to be considered, in terms of time and cost, equivalent to

reach those objects, being negligible the effort to cover the distance between the points and the objects.

Any object in a Site has, at a given time, its position in the site. The position in a site

(accessible through an entrance) characterizes the relation between an object and a site in a given period A residence is a particular position for organizations (eg: the residence of CERN research organization); see the agent module.

While the concept of place is static the concept of position is dynamic. Consider the following example.

Example

The figure shows the following facts:

the asset A1 changes its position from position 1 (in the place P1) to position 2 (in the place P2)

the asset A2 assumes the position 3 (in the place P1) after A1 has moved to position 2, leaving P1 free.

25

Figure 3: The module Space (Source: Sapientia 1.1 DIAG Sapienza University of Rome)

TERRITORY

A territory is a place where a certain administrative policy is applied. Note that the population and gross domestic product are characteristics of the site and not of the the region of space

filled by the place. The link between a person who inhabits a place and the place itself is conventional and not depending - in general - by the actual position of that person at a given time. The same applies to economic activities.

Since a territory is a role of a region of place it is thought of as including all social objects and social activities that are dependent on the role. For example a Nation is considered both as the country (a place in political geography) and the state who has sovereignty over the country. Similarly, a cluster system is a territory used as field for analyzing the economic, social, political

and institutional relationships that generate, within the boundary of the cluster itself, a collective learning process in a group of technological or functional areas.

As for European territories, the Nomenclature of Territorial Units for Statistics (NUTS) was drawn up by Eurostat more than 30 years ago in order to provide a single uniform breakdown of territorial units for the production of regional statistics for the European Union. Starting from 2003 it has been adopted by European Parliament. A particularly important goal of the

Regulation is to minimize the impact of changes in the national administrative structures on the

availability of comparable regional statistics.

NUTS is a hierarchical system for dividing up the economic territory of the EU for the purpose of collecting, developing and harmonizing of European regional statistics about:

major socio-economic European regions,

basic regions with respect to the application of regional policies,

small regions for specific diagnoses.

Different criteria have been used in subdividing national territories into regions. These are normally divided into normative and analytical criteria:

normative regions are the expression of political will; their limits are fixed according to the tasks allocated to the territorial communities, according to the sizes of population necessary

to carry out these tasks efficiently and economically, and according to historical, cultural and other factors;

analytical (or functional) regions are defined according to analytical requirements; they

group together zones using geographical criteria (e.g. altitude or type of soil) or using socio-economic criteria (e.g. homogeneity, complementarities, or polarity of regional economies).

For practical reasons related to data availability and the implementation of regional policies, the nomenclature is based primarily on the institutional divisions, currently in force in the Member States (normative criteria). In our ontology, the NUTS system is modeled in the following way:

26

Major, basic and small European regions are all distinct European territories.

All major European regions are identified by their NUTS1 code;

All basic European regions are parts of a major one: the link between any basic region and the major one including it is represented by the role hasNUTS1

All basic European regions are identified, among the major one they are included in, by their NUTS2 code

The NUTS1 code of a given major region is considered also a characteristic (NUTS1ref) of the

basic regions included in that major region

All small European regions are parts of a basic one: the link between any small region and the basic one including it is represented by the role hasNUTS2

All basic European regions are identified, among the basic one they are included in, by their NUTS3 code

The NUTS2 code of given basic region and and the NUTS1 code of the major regions

including that basic region are considered also a characteristic (NUTS2ref, NUTS1ref resp.) of the small regions included in the basic region.

2.5 The Author Affiliation View

As we said before, we now provide a description of a view over the Sapientia ontology. This view collects those concepts and relations that are relevant for the task of deciding author affiliation as found in atomic publications, and is illustrated in Appendix 3.

As one can easily verify, the Author Affiliation View is constituted by a set of concepts and relations that are present in the modules described above. The new aspect of the view is that such concepts and relations are organized in a simplified way, so as to highlight and gather links

that are immediately useful in our subsequent analysis. In particular, the following observations hold:

The relation has_affiliation appearing in the view is the specialization of the relation with the same name shown in the Agent module to the case where the Agent is an Author.

The relation in_city is a shortcut for the link between an Organization and the City of the Address where its Residence is located, where such link is the chain of the following relations

from Organization to City: has_state_of_organization, has_residence, has_position, has_entrance, is_in_city.

The relation has_publication_affiliation is a shortcut for the chain of the relations has_author and has_affiliation. Following this relation, one is able, for each paper, to reach the Organizations which the authors of the publication are affiliated to.

It is obvious that point 3 above is particularly important for the European Map of Excellence. Indeed, as already observed, the main general goal of the present study is to discuss the

problem of establishing a comprehensive list of research organizations, disambiguating the ways in which research organizations show up in bibliographic data, and locating the scientific production of research organizations at geographic level.

In terms of the Author Affiliation View of the ontology this means the following.

Establishing a comprehensive list of research organizations means to single out a mechanism allowing us to compile a sound and complete list of Organizations (in particular Public Research Institutions), together with its relevant properties (in particular the City of

residence). We will use the term "Authority File UNI" or "Master list for Organization" for such list. In other words, this aspect amounts to finding out which are the correct instances

of the concept Organization, where correct here means faithful to the real world we are modelling.

Disambiguating the ways in which research organizations show up in bibliographic data means:

to single out a mechanism allowing us to compile a sound and complete list of Atomic Publications. In other words, this aspect amounts to finding out the correct instances of the concept Atomic_publication, and

27

To single out a mechanism allowing us to find which are the Organizations that are to be associated with the various atomic publications, i.e., that corresponds to the affiliations of the authors of atomic publications. In other words, this aspect amounts to finding out the correct instances of the relation has_publication_affiliation, the correct instances of the concept Affiliation, and the correct instances of the relation has_involvment_in.

Indeed, it is the chain of has_publication_affiliation and has_involvment_in that allows us to link every atomic publication to the correct organizations.

Locating the scientific production of research organizations at geographic level means to be able to associate with each organization the geographical region where the organization resides. In other words, this aspect amounts to finding out the correct instances of the relation in_city.

Note that there are concepts and relations of the Author Affilation View for which we will not use their instances in the following, namely:

the concept Author

the relation has_affiliation.

Also, notice that we assume to have the set of correct instances of the following concepts and relations:

the concept Atomic_publication. Indeed, such instances are provided by the bibliographic databases;

the concept Affiliation_in_atomic_publication. Again, such instances are provided by the

bibliographic databases;

the set of City. Such instances can be collected from various reliable sources.

28

3. TECHNICAL SPECIFICATIONS FOR THE EUROPEAN MAP OF EXCELLENCE (EME)

As we said, the main general goal of the present study is to address three problems: (1)

establishing a comprehensive list of research organizations, in particular, Public Research Organizations - PROs, (2) disambiguating the ways in which research organizations (in particular PROs) show up in bibliographic data, and (3) locating the scientific production of research organizations (in particular PROs) at geographic level.

In section 3.1 we address problem 1 with regard to universities. In other words, we illustrate a

way to set up a data set that can be used to establish a high-quality mapping to the concept University of the ontology and their relevant properties. Therefore, section 3.1 reports on activity a) of the Technical and Financial Offer, the one regarding the creation of an Authority File Nomenclature of Units for Territorial Statistics (NUTS 2) in which all universities in EU 28 + Switzerland and Norway have a precise geographic location using GIS coordinates.

Section 3.2 addresses problem 2, by illustrating a number of issues arising in the disambiguation of institution affiliations in scientific publications.

Section 3.3 continues the study of problem 2, and illustrates what has been done in the GRBS (Global Research Benchmarking System) for academic institutions. In this sense, it therefore

reports on activity a) of the Technical and Financial Offer, the one concerning the disambiguation and validation of affiliations of academic institutions used in publications indexed in bibliometric databases. This activity has been carried out on the basis of recent projects on the topic (i.e. ETER and EUMIDA) and other research activities carried out at the Sapienza University of Rome based on one bibliometric database.

Section 3.3 and 3.4 address problem 2 and problem 3, and reports on activity c) of the Technical and Financial Offer, the one concerning the identification of public research institutions in affiliations used on publications indexed in bibliometric databases and assigning these to Nomenclature of Units for Territorial Statistics (NUTS2 and NUTS3) regions and the proposal of an approach for identifying PROs from affiliations of publications indexed in a bibliometric database and assigning to these the NUTS2 and NUTS3 regions. The approach has been tested

on a sample of raw data extracted from a bibliometric database for a large PRO.

3.1. Gathering data about European Universities

The possibility of geo-referencing information pertaining to excellence in S&T at NUTS2 and NUTS3 level of European universities was granted by the availability of the results of two projects promoted by the European Commission, respectively:

EUMIDA - Feasibility Study for Creating a European University Data Collection (Contract No. RTD/C/C4/2009/0233402) completed in 2010;

ETER - European Tertiary Education Register (Contract No. EAC2013038) to be completed in July 2015.

Integrating information provided by the two projects and filling in few information gaps it has been possible to create an Authority File of higher education institutions that deliver the PhD degree in the European Union and in EFTA countries (Iceland, Liechtenstein, Norway, Switzerland).

The file lists 1,131 Institutions which represent slightly 50% of all institutions contained in the European tertiary education register (the remaining ones are institution not awarding doctoral -

ISCED8- degree). For each institution the file reports the ETER code, the name in national

language and in English, the NUTS codes at level 2 and 3. Additional geographical information (name of the city, postcode, GIS coordinates) are available in the ETER database.

29

In principle, the location of universities within NUTS2 regions is quite straightforward given that multi-site institutions2 with activities spreading across regional borders are not very diffused. Nevertheless almost 10% of HEIs comprised in the Authority file have secondary branches located in two or more regions (very rarely abroad).

This share increases up to 24% if we look at the NUTS3. The share of multi-site institutions is actually increasing around Europe as a consequence of two opposite phenomena: the spread of universities activities from the original seat in the surrounding region (decentralisation and

wider regional coverage) on one hand; institutional concentration though the merger of small institutions and the creation of larger and more comprehensive HEIs which usually maintain at least partially the original seats and locations in different cities and regions.

At this stage it is not possible to disentangle information for multi-site institution, and the whole university activities are located in the region of the main seat. This could create a slight distortion especially when data are analysed by disciplinary area. It could happen that all

function in a disciplinary area are located in a secondary campus outside the region (i.e. the medical department of Universit Cattolica del Sacro Cuore is actually located in Rome: in the map of excellence figures will be attributed to ITC4 region instead of ITI43).

The creation of the Authority File is the result of the following analytical steps:

retrieving of the most recent information contained in ETER (the result of the first data

collection referring to year 2011 are publicly available online at eter.joanneum.at/imdas-eter); data about Hungary, Romania and Slovenia are missing;

retrieving of the missing data from the EUMIDA DC1 dataset publicly available for Hungary and Romania (with reference to year 2008), and from internal resources for Slovenia (reference year 2011 as for ETER);

revision and update of NUTS codes to the last version in order to allow for full interoperability with the Eurostat regional database;

The Authority File is available on request.

3.2 Problems in the disambiguation of institution affiliations in

scientific publications

Academic institutions constitute in most countries by far the most important type of research entities. A second important group of (mainly) publicly funded research institutions is labelled as

public research organizations (PROs). Public Research Organizations can be divided into 4 categories or ideal types (OECD Innovation Policy Platform 2011):

Mission Oriented Centres (MOCs), owned by government departments or ministries at a national level (e.g., INSERM in France; CIEMAT in Spain).

Public Research Centres (PRCs), publicly funded overarching research institutions such as CNRS in France, CNR in Italy, Max Planck Gesellschaft in Germany;

Research Technology Organizations (RTOs), often in the public sphere, private but not-for-profit, such as Fraunhofer Gesellschaft in Germany and TNO in the Netherlands

Independent Research Institutes (IRIs), often at the boundary between the public and the private sector, denoted as centre of excellence, and recently founded.

Several large disambiguation studies of the names of academic institutions of authors of millions of scientific publications have been conducted in the past, based on Thomson Reuters Web of Science (WoS) and Elseviers Scopus.

The richest brunch of literature on disambiguation in bibliometric studies has dealt with the analysis of the authors of scientific publications (author name disambiguation): see Ferreira et

al. (2012), Strotmann and Zhao (2012) and Milojevic (2013), for a summary on the existing state of the art on this subject. Another relevant branch of bibliometrics has investigated

2 According to the definition developed in ETER, multi-site institutions are defined as institutions with local establishments in NUTS3 region(s) that are different from the main seat even if the definition leaves a margin of flexibility to national statistical offices to adapt to national contexts.

30

research units and institutions as the main unit of analysis (affiliation disambiguation). For the purpose of this study we focus on the issues related with disambiguation of affiliation in bibliometric databases.

Although the problems of affiliation disambiguation are present in the literature since several

years (see the reconstruction and the analysis on technical and theoretical issues below), only few works have addressed these problems and proposed workable solutions. See the following Box 1 for an overview of the recent literature on this issue.

The disambiguation of institutional affiliations is part of the area entity resolution which addresses the general problem of identifying and linking/grouping different manifestations of the same real world object (Schaerf 2015; for an introduction see Talburt 2011).

This activity has encountered a series of problems within the bibliometric application framework. A distinction can be made between technical and theoretical issues. This section provides an overview of the main issues, and sketches the main lines of how these can be solved3. Although

almost all studies conducted in the past related to academic institutions, the issues that were encountered relate to the affiliation disambiguation of PROs as well. Additional comments are made that specifically relate to PROs.

3 This section is based on Moed (2015).

31

Box 1: Overview of the recent literature on affiliation disambiguation

Reference Content

Jiang et al., 2011 Context: conversion of publications affiliations into semantic web

data. Issues faced: affiliation ambiguity (different authors, who have the same affiliation, often express the affiliation in different ways). Solution proposed: a clustering method based on normalized compression distance.

Morillo et al., 2013 Context: standardisation of affiliations and addresses for the assessment of research production. Issues faced: to standardize or codify addresses, in order to produce bibliometric indicators from bibliographic databases. Solution proposed: A semi-automatic method which is supposed

to work with no previous existence of master lists or tables. The analysis is limited at the sectorial level (tested on a sample of 136,821 documents from WoS database), further research is envisaged to identify individual organisations.

Cuxac et al., 2013 Context: disambiguation of the affiliations of authors of scientific

papers in bibliographic databases at organisation level (laboratory, institute, university, research center). Application on a sample of French CNRS affiliations. Issues faced: high variability and heterogeneity of naming in large bibliometric databases. Solution proposed: Two approaches: the first way considers that

a training dataset is available, and uses a Naive Bayes model. The second way assumes that there is no learning resource, and uses a semi-supervised approach, mixing soft-clustering and Bayesian learning.

S. Huang et al., 2014 Context: To improve existing techniques of institution name

disambiguation based on word similarity or editing distance. Issues faced: high variability and heterogeneity of naming in large bibliometric databases.

Solution proposed: the paper propose a rule-based algorithm. One-to-many relationships between an institution and many variant names under which it is referred to in bylines of publications are recognized with the aid of statistical methods and

specific rules. The performance of the rule based institution name disambiguation algorithm is evaluated on large datasets in four fields. A test on metadata provided by the WoS shows that often basic structures, e.g. universitydepartmentstreet address, city, including ZIP code, can be recognized. The study proposes an algorithm for institution name mapping. No specific attention is

devoted to the explosion of the organisational level of institutions.

Main technical and theoretical issues

Technical issues

Researchers do not indicate their institutional affiliations in a uniform manner. Even authors

from the same institution or department may use different names. Moreover, naming conventions may change over time. Some institutions impose naming conventions upon their researchers, in order to increase the institutions visibility, also in university rankings. For these institutions the uniformity in the institutional names can be expected to be much larger

than for those in which such a policy is lacking4.

4 Another source of variability in affiliation data is the fact that publishers do not use uniform formats for the way in which institutional affiliations are recorded in a publication. For instance, some publishers in

32

Bibliographic database producers apply data capturing rules that re-format and modify the original affiliation data in the scientific articles they index: what is included in the database is not always identical to what was indicated in the original article. Producers of WoS and Scopus to some extent disambiguate institutional names; also, they insert a type of hierarchical order in the various components of an institutional name string, identifying the

so called main organization, but this process is far from perfect and may contain errors (e.g., Moed 2005). Precise estimates of their occurrence are unavailable.

Additional information sources are needed to comprehensively identify research institutions or organizations in a particular country. Normally research entities appear in many variations. Hence, when analysing the research output of a larger country, long lists of institutional names must be checked by analysts who must know what they are looking for in the data.

Theoretical issues

It is not always clear how research institutions should be defined. Should affiliated departments be considered a part of an organization? For instance, are academic or affiliated hospitals in all cases parts of universities? Should the components of a university system or umbrella organization such as University of Texas be aggregated or analysed separately? Background knowledge on the structure of individual institutions or organizations and of national research systems are indispensable. This means that a clear mapping should be

designed between the relevant data sources and the concept O