+ All Categories
Home > Technology > The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

Date post: 06-May-2015
Category:
Upload: liber-europe
View: 1,748 times
Download: 0 times
Share this document with a friend
Description:
Presentation by Dieter Van Uytvanck (CLARIN) from 'The Prefect Swell' workship on text and data mining on the 27th of September 2013.
18
The Perfect Swell: Workshop on Text and Data Mining for Data Driven Innovation The research infrastructure perspective Dieter Van Uytvanck Max Planck Institute for Psycholinguistics [email protected] TDM workshop, London 2013-09-27
Transcript
Page 1: The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

The Perfect Swell: Workshop on Text and Data Mining

for Data Driven Innovation

The research infrastructure perspective

Dieter Van Uytvanck Max Planck Institute for Psycholinguistics

[email protected] TDM workshop, London

2013-09-27

Page 2: The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

CLARIN?

§  Common Language Resources and Technology Infrastructure

§  aims at providing easy and sustainable access for scholars in the humanities and social sciences §  to digital language data (in written, spoken, video or

multimodal form) §  to advanced tools to discover, explore, exploit, annotate,

analyse or combine them §  independent of where they are located: a shared

distributed infrastructure §  More information: www.clarin.eu

TDM workshop London

2013-09-27

www.clarin.eu

Page 3: The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

Language resources: rich variety

§  Modality: written, spoken, signed §  Additional channels: eye movements, gestures, neuro-

imaging data (EEG, fMRI, …), etc.

TDM workshop London

2013-09-27

www.clarin.eu Annotations

Data: the basis for research

Page 4: The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

Language resources: rich variety

§  Location: §  data from all over the world (including

some very remote corners) §  … and from the world wide web,

smartphones, … §  Time:

§  old historic collections (hieroglyphs, manuscripts, rock carvings, …), often OCR’ed, digitised and annotated

§  up to real-time data gathered from social networks

§  Origin: §  elicited (experiments) §  natural language use (“in the wild”)

TDM workshop London

2013-09-27

www.clarin.eu

Annotations

Data: the basis for research

Page 5: The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

Data mining in CLARIN

§  very important paradigm in language resource processing §  major shift from rule-based to data-driven systems

§  not only text, also multimedia §  importance of

§  access to primary data for fellow researchers: need access to whole works and not only to snippets and sentences in order to do TDM.

§  replicating experiments utterly important §  technical support: virtual collections allow to refer to large online

data sets §  safe legal setting for researchers (license signing does not scale

to 500.000 texts that are automatically collected from thousands of websites)

TDM workshop London

2013-09-27

www.clarin.eu

Page 6: The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

Data mining in CLARIN

§  some examples to demonstrate the variation and nature of data mining based on language resources

TDM workshop London

2013-09-27

www.clarin.eu

Page 7: The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

Some examples (1) TDM workshop

London 2013-09-27

www.clarin.eu

§  Mass text analysis (Petersen et al., 2012): doi:10.1038/srep00313

Page 8: The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

Some examples (2) TDM workshop

London 2013-09-27

www.clarin.eu

§  AUVIS face/hand tracking analysis: http://tla.mpi.nl/projects_info/auvis/

Head/Hands Tracking

Page 9: The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

Some examples (3) TDM workshop

London 2013-09-27

www.clarin.eu

§  Stylometry and plagiarism detectionhttp://www.clips.ua.ac.be/category/projects/stylometry

§  e.g. Mike Kestemont, http://www.mike-kestemont.org/?p=362

Page 10: The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

Some examples (4) TDM workshop

London 2013-09-27

www.clarin.eu

§  Language evolution analysis with phylogenetic trees (Bouckaert et al., 2012) – doi:10.1126/science.1219669

As the earliest representatives of the mainIndo-European lineages, our 20 ancient languagesmight provide more reliable location informa-tion. Conversely, the position of the ancient lan-guages in the tree, particularly the three Anatolianvarieties, might have unduly biased our resultsin favor of an Anatolian origin. We investigatedboth possibilities by repeating the above analy-

ses separately on only the ancient languages andonly the contemporary languages (which ex-cludes Anatolian). Consistent with the analysisof the full data set, both analyses still supportedan Anatolian origin (Table 1).

The RRW approach avoids internal node as-signments over water, but it does assume, alongthe unknown tree branches, the same underlying

migration rate across water as across land. Toinvestigate the robustness of our results to het-erogeneity in rates of spatial diffusion, we devel-oped a second inference procedure that allowsmigration rates to vary over land and water (15).This landscape-based model allows for the in-clusion of a more complex diffusion process inwhich rates of migration are a function of geog-raphy. We examined the effect of varying relativerate parameters to represent a range of differentmigration patterns (15). Figure 1B shows the in-ferred Indo-European homeland under a modelin which migration from land into water is lesslikely than from land to land by a factor of 100.At the other extreme, we fit a “sailor”model withno reluctance to move into water and rapid move-ment across water. Consistent with the findingsbased on the RRW model, each of the landscape-based models supports the Anatolian farmingtheory of Indo-European origin (Table 1).

Our results strongly support an Anatolianhomeland for the Indo-European language family.The inferred location (Fig. 1) and timing [95%highest posterior density (HPD) interval, 7116 to10,410 years ago] of Indo-European origin is con-gruent with the proposal that the family beganto diverge with the spread of agriculture from

Fig. 2. Map and maximum clade credibility tree showing the diversificationof the major Indo-European subfamilies. The tree shows the timing of theemergence of the major branches and their subsequent diversification. Theinferred location at the root of each subfamily is shown on the map, colored

to match the corresponding branches on the tree. Albanian, Armenian, andGreek subfamilies are shown separately for clarity (inset). Contours representthe 95% (largest), 75%, and 50% HPD regions, based on kernel densityestimates (15).

Table 1. Bayes factors comparing support for the Anatolian and steppe hypotheses. We estimatedBayes factors directly, using expectations of a root model indicator function taken over the MCMCsamples drawn from the posterior and prior of each hypothesis. Bayes factors greater than 1 favoran Anatolian origin. A Bayes factor of 5 to 20 is taken as substantial support, greater than 20 asstrong support, and greater than 100 as decisive (30).

Phylogeographic analysisBayes factor

Anatolian vs. steppe I Anatolian vs. steppe II

RRW: All languages 175.0 159.3RRW: Ancient languages only 1404.2 1582.6RRW: Contemporary languages only 12.0 11.4Landscape aware: Diffusion 298.2 141.9Landscape aware: Migration from land into water less

likely than from land to land by a factor of 10197.7 92.3

Landscape aware: Migration from land into water lesslikely than from land to land by a factor of 100

337.3 161.0

Landscape aware: Sailor 236.0 111.7

www.sciencemag.org SCIENCE VOL 337 24 AUGUST 2012 959

REPORTS

on

Augu

st 2

4, 2

012

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

Page 11: The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

The research infrastructure role

§  Data sets: §  Long-term preservation (archiving) §  Making them citable (persistent identifiers) and findable

(metadata) §  Making access easier with federated login

§  Lowering the threshold to use advanced software §  offer web front-ends, web service chains §  cooperation with computing centres for heavy tasks

§  Know-how building & support §  about the nature of the resources and tools §  technical matters §  legal issues

TDM workshop London

2013-09-27

www.clarin.eu

Page 12: The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

Legal perspective on resources TDM workshop

London 2013-09-27

www.clarin.eu

§  Rough classification of language resources available via the CLARIN centres:

§  Public §  full access, no restrictions at all §  e.g. parallel corpora from the EU Parliament

§  Academic §  available for all academic users §  e.g. corpus spoken Dutch (radio recordings, …)

§  Restricted §  everything more restricted than Academic >

personalised access rules §  e.g. video from doctor-patient interaction

Common Language Resources and Technology Infrastructure

CLARIN-2008-4 27

The first two classification categories PUB and ACA are more demanding for the licensing agreements. In RES basically any kind of licensing language is accepted and it is up to the user to fulfil the requirement. However, the goal of CLARIN is to have as much of its content in the first two categories as possible, and therefore the use of the upgrade agreements is strongly advised. Some of the resources may be protected because the identity of research participants is disclosed or there is a risk that the identity of research participants could be disclosed. Each Content Provider conducts a disclosure risk review of his resources in order to determine whether any data items could be used to identify individual respondents (Personal Data, see case 3 in Part IV below). The review is also used for assessing the option to offer the material for the CLARIN use, and for informing the Service Providers about resources including Personal Data and any restrictions on such data items. Personal Data is normally disclosed to an End-User only with the permission of the Content Provider. The summary of the classification task for the Content Owner or the Content Provider is shown in Figure 5 above. The CLARIN prototype specifications for PUB, ACA and RES are shown in the following picture. Examples of each process are found in conjunction with the case studies 1, 2 and 3 in Part IV below.

Resource Categories

CC0, GPL, LGPL

~CC-by CC-nc ~CC-sae.g.CC-nd

Special conditions need to be accepted

Other

2.12.20108

IPR Issues Figure 6 Three main content categories PUB, ACA and RES and the three sub-categories indicating the additional requirements Inf, NC and ReD.

3.3 The prerequisites for access rights

Authentication is a prerequisite but authorization can be automatic, e.g. the applicant belongs to a certain group and signs an agreement “I agree” to the conditions related to the resource in question.

Common Language Resources and Technology Infrastructure

CLARIN-2008-4 27

The first two classification categories PUB and ACA are more demanding for the licensing agreements. In RES basically any kind of licensing language is accepted and it is up to the user to fulfil the requirement. However, the goal of CLARIN is to have as much of its content in the first two categories as possible, and therefore the use of the upgrade agreements is strongly advised. Some of the resources may be protected because the identity of research participants is disclosed or there is a risk that the identity of research participants could be disclosed. Each Content Provider conducts a disclosure risk review of his resources in order to determine whether any data items could be used to identify individual respondents (Personal Data, see case 3 in Part IV below). The review is also used for assessing the option to offer the material for the CLARIN use, and for informing the Service Providers about resources including Personal Data and any restrictions on such data items. Personal Data is normally disclosed to an End-User only with the permission of the Content Provider. The summary of the classification task for the Content Owner or the Content Provider is shown in Figure 5 above. The CLARIN prototype specifications for PUB, ACA and RES are shown in the following picture. Examples of each process are found in conjunction with the case studies 1, 2 and 3 in Part IV below.

Resource Categories

CC0, GPL, LGPL

~CC-by CC-nc ~CC-sae.g.CC-nd

Special conditions need to be accepted

Other

2.12.20108

IPR Issues Figure 6 Three main content categories PUB, ACA and RES and the three sub-categories indicating the additional requirements Inf, NC and ReD.

3.3 The prerequisites for access rights

Authentication is a prerequisite but authorization can be automatic, e.g. the applicant belongs to a certain group and signs an agreement “I agree” to the conditions related to the resource in question.

Common Language Resources and Technology Infrastructure

CLARIN-2008-4 27

The first two classification categories PUB and ACA are more demanding for the licensing agreements. In RES basically any kind of licensing language is accepted and it is up to the user to fulfil the requirement. However, the goal of CLARIN is to have as much of its content in the first two categories as possible, and therefore the use of the upgrade agreements is strongly advised. Some of the resources may be protected because the identity of research participants is disclosed or there is a risk that the identity of research participants could be disclosed. Each Content Provider conducts a disclosure risk review of his resources in order to determine whether any data items could be used to identify individual respondents (Personal Data, see case 3 in Part IV below). The review is also used for assessing the option to offer the material for the CLARIN use, and for informing the Service Providers about resources including Personal Data and any restrictions on such data items. Personal Data is normally disclosed to an End-User only with the permission of the Content Provider. The summary of the classification task for the Content Owner or the Content Provider is shown in Figure 5 above. The CLARIN prototype specifications for PUB, ACA and RES are shown in the following picture. Examples of each process are found in conjunction with the case studies 1, 2 and 3 in Part IV below.

Resource Categories

CC0, GPL, LGPL

~CC-by CC-nc ~CC-sae.g.CC-nd

Special conditions need to be accepted

Other

2.12.20108

IPR Issues Figure 6 Three main content categories PUB, ACA and RES and the three sub-categories indicating the additional requirements Inf, NC and ReD.

3.3 The prerequisites for access rights

Authentication is a prerequisite but authorization can be automatic, e.g. the applicant belongs to a certain group and signs an agreement “I agree” to the conditions related to the resource in question.

Page 13: The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

Legal perspective on resources

§  CLARIN recommends CC licenses for new resources as this is the least problematic for all in the long run. Such resources can be made publicly available.

§  For older material, we try to distribute them as freely as can be negotiated. For these we offer two categories: §  resources free for researchers §  resources requiring individual permission by the owner.

§  It is good to note that not everything is about copyright. §  We also have to deal with personal data which can only be

provided for a limited time to individual researchers unless they are anonymized.

§  Also ethical perspectives should be taken into account. (e.g. asking participants if they are ok with data mining/processing at the time of recording)

TDM workshop London

2013-09-27

www.clarin.eu

Page 14: The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

Technical Perspective (1)

§  The above restrictions can be realized by requiring: §  PUB - no identification of the user and no individual

permission, i.e. the resources are free for all and publicly available.

§  ACA - identification of the user, but no individual permission, e.g. CLARIN-distributed resources for academic use.

§  RES - identification of the user and individual usage permission, i.e. the resources are restrictedly available to individual researchers, e.g. resources containing personal data.

TDM workshop London

2013-09-27

www.clarin.eu

Page 15: The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

Technical Perspective (2)

§  Federated Identity Management (“Shibboleth”) §  allows to access resources at a remote server §  with institutional credentials §  makes it relatively straight-forward to recognize academic

users and grant them access to restricted resources §  details: http://clarin.eu/node/3788

TDM workshop London

2013-09-27

www.clarin.eu

Page 16: The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

Future perspective for legal exception framework

§  As we in CLARIN are capable of §  identifying researchers and §  protecting the resources from other users,

§  CLARIN already has all the technical prerequisites needed for implementing and supervising a broad research exception in the EU such as the one already in effect in the Netherlands.

TDM workshop London

2013-09-27

www.clarin.eu

Page 17: The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

Conclusion

§  Datamining plays an increasingly important role in (language resource-based) research

§  Research infrastructures try to assist academics to make efficiently use of the existing resources and tools

§  Many technical issues have been addressed already (e.g. authentication of researchers)

§  We hope remaining legal (copyright) issues could be addressed by a research exception (or likewise a concept of fair use)

TDM workshop London

2013-09-27

www.clarin.eu

Page 18: The research infrastructure perspective, Dieter Van Uytvanck, CLARIN

Acknowledgement

§  Thanks to Krister Lindén and Erik Ketzan from the CLARIN legal issues committee for their valuable input!

§  Thank you for your attention!

TDM workshop London

2013-09-27

www.clarin.eu


Recommended