+ All Categories
Home > Technology > OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics,...

OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics,...

Date post: 26-Jan-2015
Category:
Upload: openaire
View: 105 times
Download: 3 times
Share this document with a friend
Description:
Presentation at the OpenAIRE-COAR Conference: "Open Access Movement to Reality: Putting the Pieces Together", Athens - May 21-22, 2014. Argo: a platform for interoperable and customisable text analytics, by Sophia Ananiadou - School of Computer Science, Director, National Centre for Text Mining, University of Manchester
Popular Tags:
31
Argo: a platform for interoperable and customisable text mining Sophia Ananiadou National Centre for Text Mining School of Computer Science The University of Manchester
Transcript
Page 1: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Argo: a platform for interoperable and customisable text mining

Sophia Ananiadou National Centre for Text Mining

School of Computer Science

The University of Manchester

Page 2: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Overview

• Sharing tools, resources and text mining workflows

• Challenges

• Interoperable infrastructure for processing and annotation

2 Open AIRE-COAR Conference Ananiadou

Page 3: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

NaCTeM

• 1st publicly funded national text mining centre

• Location: Manchester Institute of Biotechnology

• Phase I - Biology (2004-2008)

• Phase II - Biology, Medicine, Social Sciences (2008-2011)

• Phase III – Biology, Medicine, Humanities, Social Sciences; Fully sustainable centre (2011-)

www.nactem.ac.uk

Page 4: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Challenges

Language Technology

Languages English French German Spanish Portuguese Italian Polish …. Chinese Hindu Urdu Japanese Korean…. Tasks

Translation Information Extraction Semantic Search Question Answering Sentiment Analysis Summarization Knowledge Discovery

….

Domains Finance/Business Health Biology Social Sciences Humanities….

Text Types Newswire Scientific Literature Full papers/abstracts Twitter Patents Clinical records, EMR Textbooks, monographs Online forums….

Technology Sentence Splitter Paragraph Splitter NP Chunkers C-parser D-parser Semantic parser NE recognizers Relation recognizers …….

Diversity of Languages

Diversity of Contexts

Diversity of Applications

TM Workflows

TM Modules

Shared!

4 Open AIRE-COAR Conference Ananiadou

Page 5: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Metadata Languages English French German Spanish Portuguese Italian Polish …. Chinese Hindu Urdu Japanese Korean… Tasks

Translation Information Extraction Semantic Search Question Answering Sentiment Analysis Summarization Knowledge Discovery

….

Language Technology

Linguistic Resources Knowledge Resources

Resource-Rich

Big Data Big Text

Cloud Computing Crowd Sourcing

Big Ontology

Text Types Newswire Scientific Literature Full papers/abstracts Twitter Patents Clinical records, EMR Textbooks, monographs Online forums….

Domains Finance/Business Health Biology Social Sciences Humanities….

5 Open AIRE-COAR Conference Ananiadou

OPEN SCIENCE

Page 6: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Requirements from TM infrastructure

• Modularity of TM modules

• Interoperability among TM modules and resources

• Generic across different languages, domains, and text types

– Adaptability

6 Open AIRE-COAR Conference Ananiadou

Page 7: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Module

Interoperability and Adaptability

Module

Module

Resources Dictionaries Ontologies

Adaptation

Rule Writing

(Annotated) Text

Interoperability and Adaptability in Resource-rich TM INFRASTRUCTURES!

Dependency Parser

English French German Japanese Greek

POS Tagger Named Entity Languages

Text Types Domains

7 Open AIRE-COAR Conference Ananiadou

Page 8: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Example: extracting proteins, annotations

8

GENIA

PennBioIE

AIMed

GENETAG

Incompatibility

Type definitions Texts

Problem: Inconsistency

Open AIRE-COAR Conference Ananiadou

Page 9: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

The problem with incompatibility

• Difficult to evaluate NERs

9 Corpus C Corpus D

NER A

Which NER is best for my

task?

NER B

A: 93% B: 36% A is better than B.

A: 63% B: 90% B is better than A.

Why so different among different corpora and

NERs ?

Open AIRE-COAR Conference Ananiadou

Page 10: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Text mining workflows

• A pipeline that executes particular tools and resources in order

• Example: semantic search

• Various versions (language- or domain-specific) of basic components needed for different applications and tasks

• Different workflows can be created, compared and evaluated by the ability to seamlessly “mix and match” various versions of components

PoS Tagger

Dictionary Lookup

NE Extraction

Chunking Parsing Semantic

Query

10 Open AIRE-COAR Conference Ananiadou

Page 11: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Text mining workflows

Interoperability

Common Data Representation and Types

IBM Journal of Research and Development (2011)

U-Compare: a modular NLP workflow construction and evaluation system.

Kano, Y., Miwa, M., Cohen, K. B., Hunter, L., Ananiadou, S. and Tsujii, J.

11 Open AIRE-COAR Conference Ananiadou

Page 12: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Common Type System

• A common type system is required for the complete interoperability

• Solution: Maintain local type systems and bridge them via a sharable type system

12

A single common type is almost impossible to impose

for all developers.

U-Compare

Sharable Type System Local Type System A Local Type System B

bridging bridging

12 Open AIRE-COAR Conference Ananiadou

Page 13: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

U-Compare Type System

Syntactic Level

Document Level

Semantic Level

13 Open AIRE-COAR Conference Ananiadou

Page 14: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

POS tagger B

Sentence Splitter B

library

POS tagger A

Sentence Splitter A

NER

Sentence Splitter A Sentence Splitter A Sentence Splitter A

Sentence Splitter B Sentence Splitter B Sentence Splitter B

POS tagger A

POS tagger A

POS tagger A

POS tagger B

POS tagger B

POS tagger B

NER NER NER

Workflow A Workflow B Workflow C

F-Score A F-Score B F-Score C

U-Compare: Evaluate and Compare TM Worklfows

UIMA SD

OpenNLP SD

GENIA SD

UIMA Tokenizer

OpenNLP Tokenizer

GENIA Tagger as Tokenizer

GENIA Tagger

Stepp Tagger

OpenNLP Tagger

ABNER

MedT-NER

GENIA Tagger as NER

Page 15: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

• Web-based application

• Interactive creation of workflows

• Cloud and high-performance computing

• Integrated TM/NLP processing system • GUI for workflow creation • Library of ready-to-use processing components • Statistics, visualizations, developer APIs • Supports UIMA • http://argo.nactem.ac.uk

15

Database: The Journal of Biological Databases and Curation (2012)

Argo: an integrative, interactive, text mining-based workbench supporting curation.

Rak, R., Rowley, A., Black, W.J. and Ananiadou, S

Page 16: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Structured Data

Remote Processing

Workflow Diagramming

Workflow Designer

Manual Editing

Annotator/Curator

Processing Components

Developers

UIMA Compliance

16 Ananiadou

Page 17: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Processing Components

• Approaching 100 components (U-Compare)

– Additional 50 will be added soon

• META-NET

• Developed or co-developed by NaCTeM

– Planned: Make the library open to others to contribute

• Generic Listener component

– Developers can plug in their own locally run UIMA component to a workflow in Argo

17 Open AIRE-COAR Conference Ananiadou

Page 18: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Remote Processing

• Single machine execution

– In-house high-performance machines

• Distributed processing

– HTCondor

– VMware vCloud (EBI) EUPMC

– Planned: EC2, Azure, …

18 Open AIRE-COAR Conference Ananiadou

Page 19: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Workflows

• Users create workflows as block diagrams

• Workflows can be shared among users

– Read only

– Planned: Read & write

– Planned: downloadable workflows

• Workflows can be deployed as web services

– Plain text (input only), XMI, RDF, BioC

19 Open AIRE-COAR Conference Ananiadou

Page 20: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Workflows view

20 Open AIRE-COAR Conference Ananiadou

Page 21: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Workflow Editor

21 Open AIRE-COAR Conference

Page 22: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Sample Use Cases

1 Recognition of chemical entities (chemical NER)

2 Semi-automatic curation of metabolic pathways

3 Evaluation of inter-annotator agreement

4 Information extraction as a Web service

Ananiadou Open AIRE-COAR Conference 22

Page 23: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Use Case 1: Chemical NER

Supplies gold standard corpus

Removes golden annotations so that they can be created

automatically

Combinations of syntactic and semantic components create

annotations

Compares and reports precision, recall and F1 of the different branches against the gold standard corpus

Page 24: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Chemical Entity Recogniser

• Chemical model evaluated at BioCreative IV CHEMDNER challenge

• The challenge

– Data: 10,000 manually annotated PubMed abstracts

– Automatically recognises names of chemical entities in text

24 Open AIRE-COAR Conference Ananiadou

Page 25: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Chemical Entity Recogniser

• Our solution

– Ranked unique mentions: ranked 1st out of 18 groups

– All mentions: ranked 3rd out of 19 groups

Subtask Precision % Recall % F-score %

Ranked unique mentions 91 85 88

All mentions 93 81 87

25 Open AIRE-COAR Conference Ananiadou

Page 26: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Use Case 2: Semi-automatic Curation – Metabolic Pathways

Search for relevant

documents

Manual correction of automatic annotations

NER for chemicals, genes, process

indicators

Linking to ontologies: CTD, ChEBI, UniProt

26 Open AIRE-COAR Conference Ananiadou

Save results in various formats,

e.g., RDF for querying and

incorporation into databases

Page 27: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Manual Annotation Editor

Create new annotations by selecting text

Create, modify or delete annotations

Edit details of annotations

Open a graphical interface to link annotations to

ontologies

27 Open AIRE-COAR Conference Ananiadou

Page 28: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Filtering and converting annotations

28 Open AIRE-COAR Conference Ananiadou

Page 29: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Manual Annotation Editor: linking to ontologies Automatic pre-

selection can be modified by the user

Details show ontology entry

webpage

29 Open AIRE-COAR Conference Ananiadou

Page 30: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Use Case 3: Information extraction as a Web service

Web service-enabled reader

Web service-enabled writer

34 Open AIRE-COAR Conference Ananiadou

Page 31: OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

Language Universal

• Reusable modules

• Generic TM modules: Competence

• Annotated Text, corpora: Performance

• Standards of Data Representation and Types for Resources: Competence

• Dictionaries, Thesauri, Ontologies: Performance

36 Open AIRE-COAR Conference Ananiadou


Recommended