+ All Categories
Home > Technology > Platform for dynamic deployment of information extraction web services

Platform for dynamic deployment of information extraction web services

Date post: 22-Jan-2018
Category:
Upload: eduardo-duarte
View: 124 times
Download: 0 times
Share this document with a friend
40
Platform for dynamic deployment of information extraction web services Supervisors: Author: David Campos José Luís Oliveira Eduardo Duarte [email protected] [email protected] [email protected]
Transcript

Platform for dynamicdeployment of information

extraction web services

Supervisors:

Author:

David Campos

José Luís Oliveira

Eduardo Duarte

[email protected]

[email protected]

[email protected]

Context

Automation of

• Information extraction (text mining)

• Concept recognition (machine learning, dictionaries)

in biomedical documents

2

Pre-processing

ConceptRecognition

Post-processing

unprocessedCorpus

processedCorpus

ContextOverall workflow of IE solutions

Ontologies3

ContextNejiExisting framework for biomedical concept recognition

4

Main Objectives

•Develop a second version of the tool(backwards-incompatible changes)

•Make the tool easier for developers andresearchers

• Create a dynamic, persistent space wheremultiple information extraction tools can beindependently managed.

5

The ProblemWeb services platform

User sends input document to the server and retrieves anannotated document

6

The ProblemWeb services platform

User sends input document to the server and retrieves anannotated document

7

The ProblemWeb services platform

User sends input document to the server and retrieves anannotated document

8

The ProblemWeb services platform

If the User needs to use his custom ontologies, they must sentto an Administrator first

9

The ProblemWeb services platform

If the User needs to use his custom ontologies, they must sentto an Administrator first

10

The ProblemWeb services platform

If the User needs to use his custom ontologies, they must sentto an Administrator first

11

The ProblemWeb services platform

If there are multiple users that need different ontologies (if evensome of them are the same), we need multiple server instanceswith different configurations

12

Instead, we should have a single machine that:

- stores all ontologies (non repeated);

- manages ontologies used per client.

The ProblemWeb services platform

13

Instead, we should have a single machine that:

- stores all ontologies (non repeated);

- manages ontologies used per client.

The ProblemWeb services platform

14

Instead, we should have a single machine that:

- stores all ontologies (non repeated);

- manages ontologies used per client.

The ProblemWeb services platform

15

Instead, we should have a single machine that:

- stores all ontologies (non repeated);

- manages ontologies used per client.

The ProblemWeb services platform

16

Instead, we should have a single machine that:

- stores all ontologies (non repeated);

- manages ontologies used per client.

The ProblemWeb services platform

17

Instead, we should have a single machine that:

- stores all ontologies (non repeated);

- manages ontologies used per client.

The ProblemWeb services platform

18

Additionally, ontology loading should occur on-demand. In otherwords, they should be used automatically for all subsequentrequests.

The ProblemWeb services platform

19

Additionally, ontology loading should occur on-demand. In otherwords, they should be used automatically for all subsequentrequests.

The ProblemWeb services platform

20

Additionally, ontology loading should occur on-demand. In otherwords, they should be used automatically for all subsequentrequests.

The ProblemWeb services platform

21

Additionally, ontology loading should occur on-demand. In otherwords, they should be used automatically for all subsequentrequests.

The ProblemWeb services platform

22

• create a on-demand server with all of Neji’s features available on the web;

• this server should:

• receive processing REST requests (GET and POST);

• manage multiple simultaneous web services;

• persist ontologies and jobs to avoid data-loss.

RequirementsWeb services platform

23

To achieve this dynamism in the web platform, we needed to make multiple changes to the core tool codebase

• Overall development Java SE 7

• Versioning Git

• Unit testing JUnit

• Continuous Integration Jenkins

• GDEP fork C++

• Text pipelining Monq.JFA

• Multi-language NLP support OpenNLP

• REST e Web Services Jersey

• Embedded server Jetty

Tasks & Technologies involved

24

Task 1: Data sharing between modules

• Modules receive an input StringBuffer and performtransformations on it• but what if the modules need to share more data

between them?

• data sharing between module instances in a non-deterministic environment is problematic

• however, the pipeline instance itself is already a sharedresource

• Restructured module arquitecture so theyhave access to a blocking list on the pipeline

25

Task 2: Multiple output formats

26

Task 2: Multiple output formats

27

Task 2: Multiple output formats

Before:• a pipeline holds a single writer and

produces a single output file• to obtain the same processing in

different formats, we need to runthe pipeline multiple times

After:• a pipeline holds multiple writers and

produces multiple output files• only one pipeline execution

required28

Task 3: Pipeline validationBefore: exception thrown on Dictionary, pipeline thread needsto be interrupted manually

pt.ua.tm.neji.exception.NejiException: Dictionary.javafrom monq.jfa.ReSyntaxException: monq.jfa.DfaRun.java

Writers

DictionaryTagger• Requires tokens• Provides annotations

Reader• Requires nothing• Provides passages

29

Task 3: Pipeline validation

Writers

After: exception thrown before Pipeline thread is even launched

pt.ua.tm.neji.exception.NejiException: Dictionary requiredTokens, which were not provided by earlier modules in thepipeline.

DictionaryTagger• Requires tokens• Provides annotations

Reader• Requires nothing• Provides passages

30

Task 3: Pipeline validation

Writers

After: exception thrown before Pipeline thread is even launched

No exceptions thrown

Reader• Requires nothing• Provides passages NLP

• Requires passages• Provides tokens

DictionaryTagger• Requires tokens• Provides annotations

31

Task 3: Pipeline validation

@Requires(Tokens)

@Provides(Annotations)

public class DictionaryTagger extends BaseTagger {

requirement and provision of resources is enforced with Java annotations on each module

32

public class RawReader extends BaseReader {public RawReader() {

try {Nfa nfa = new Nfa(“.+”, action);setNFA(nfa, DfaRun.UNMATCHED_COPY);

} catch (ReSyntaxException ex) {throw new NejiException(ex);

}}

private AbstractFaAction action = new AbstractFaAction() {public void invoke(StringBuffer yytext, int start, DfaRun runner) {

StringBuilder sb = new StringBuilder();sb.append("<roi>");

String s = XMLParsing.solveXMLEscapingProblems(yytext.toString());String unescapedText = StringEscapeUtils.unescapeXml(s);unescapedText = unescapedText.replaceAll("\n", "</roi>\n<roi>");

sb.append(unescapedText);sb.append("</roi>");yytext.replace(start, yytext.length(), sb.toString());

}};

}

Before: interacts directly with the Monq.JFA API

Task 4: Reduce boilerplate code

33

public class RawReader extends BaseReader {public RawReader() {

super(UNMATCHED_COPY);super.addActionToRegex(".+");

}

@Overridepublic String execute() {

StringBuilder sb = new StringBuilder();sb.append("<roi>");

String s = solveEscaping(output.toString());String unescapedText = unescapeXml(s);

sb.append(unescapedText);sb.append("</roi>");return sb.toString();

}}

After: interacts with our API, simplifying implementation

Task 4: Reduce boilerplate code

34

Dependencies

• Text parsing is covered bythe NLP module;

• the level of NLP varies per request!!

• if dependencies are required, the NLP module does dependency parsingfor all requests (evenrequests that only requirea lower level of parsing)

Chunking

Tokenization

Task 5: Dynamic NLP

35

Dependencies

Chunking

Tokenization

NLP

Task 5: Dynamic NLPBefore: GDEP initialized for dependency parsing, dependencyparsing is executed for all requests

36

Dependencies

Chunking

Tokenization

NLP

Task 5: Dynamic NLPBefore: GDEP initialized for dependency parsing, dependencyparsing is executed for all requests

37

Dependencies

Chunking

Tokenization

NLP

Task 5: Dynamic NLPAfter: GDEP fork, initialized for dependency parsing, but usingdifferent parsing levels per request

38

Dependencies

Chunking

Tokenization

NLP

Task 5: Dynamic NLPAfter: GDEP fork, initialized for dependency parsing, but usingdifferent parsing levels per request

39

Thank you for your attention

40


Recommended