Platform for dynamic deployment of information extraction web services

Platform for dynamicdeployment of information

extraction web services

Supervisors:

Author:

David Campos

José Luís Oliveira

Eduardo Duarte

[email protected]

[email protected]

[email protected]

Context

Automation of

• Information extraction (text mining)

• Concept recognition (machine learning, dictionaries)

in biomedical documents

2

Pre-processing

ConceptRecognition

Post-processing

unprocessedCorpus

processedCorpus

ContextOverall workflow of IE solutions

Ontologies3

ContextNejiExisting framework for biomedical concept recognition

4

Main Objectives

•Develop a second version of the tool(backwards-incompatible changes)

•Make the tool easier for developers andresearchers

• Create a dynamic, persistent space wheremultiple information extraction tools can beindependently managed.

5

The ProblemWeb services platform

User sends input document to the server and retrieves anannotated document

6



7



8


If the User needs to use his custom ontologies, they must sentto an Administrator first

9



10



11


If there are multiple users that need different ontologies (if evensome of them are the same), we need multiple server instanceswith different configurations

12

Instead, we should have a single machine that:

- stores all ontologies (non repeated);

- manages ontologies used per client.


13





14





15





16





17





18

Additionally, ontology loading should occur on-demand. In otherwords, they should be used automatically for all subsequentrequests.


19



20



21



22

• create a on-demand server with all of Neji’s features available on the web;

• this server should:

• receive processing REST requests (GET and POST);

• manage multiple simultaneous web services;

• persist ontologies and jobs to avoid data-loss.

RequirementsWeb services platform

23

To achieve this dynamism in the web platform, we needed to make multiple changes to the core tool codebase

• Overall development Java SE 7

• Versioning Git

• Unit testing JUnit

• Continuous Integration Jenkins

• GDEP fork C++

• Text pipelining Monq.JFA

• Multi-language NLP support OpenNLP

• REST e Web Services Jersey

• Embedded server Jetty

Tasks & Technologies involved

24

Task 1: Data sharing between modules

• Modules receive an input StringBuffer and performtransformations on it• but what if the modules need to share more data

between them?

• data sharing between module instances in a non-deterministic environment is problematic

• however, the pipeline instance itself is already a sharedresource

• Restructured module arquitecture so theyhave access to a blocking list on the pipeline

25

Task 2: Multiple output formats

26


27


Before:• a pipeline holds a single writer and

produces a single output file• to obtain the same processing in

different formats, we need to runthe pipeline multiple times

After:• a pipeline holds multiple writers and

produces multiple output files• only one pipeline execution

required28

Task 3: Pipeline validationBefore: exception thrown on Dictionary, pipeline thread needsto be interrupted manually

pt.ua.tm.neji.exception.NejiException: Dictionary.javafrom monq.jfa.ReSyntaxException: monq.jfa.DfaRun.java

Writers

DictionaryTagger• Requires tokens• Provides annotations

Reader• Requires nothing• Provides passages

29

Task 3: Pipeline validation

Writers

After: exception thrown before Pipeline thread is even launched

pt.ua.tm.neji.exception.NejiException: Dictionary requiredTokens, which were not provided by earlier modules in thepipeline.


Reader• Requires nothing• Provides passages

30


Writers

After: exception thrown before Pipeline thread is even launched

No exceptions thrown

Reader• Requires nothing• Provides passages NLP

• Requires passages• Provides tokens


31


@Requires(Tokens)

@Provides(Annotations)

public class DictionaryTagger extends BaseTagger {

…

requirement and provision of resources is enforced with Java annotations on each module

32

public class RawReader extends BaseReader {public RawReader() {

try {Nfa nfa = new Nfa(“.+”, action);setNFA(nfa, DfaRun.UNMATCHED_COPY);

} catch (ReSyntaxException ex) {throw new NejiException(ex);

}}

private AbstractFaAction action = new AbstractFaAction() {public void invoke(StringBuffer yytext, int start, DfaRun runner) {

StringBuilder sb = new StringBuilder();sb.append("<roi>");

String s = XMLParsing.solveXMLEscapingProblems(yytext.toString());String unescapedText = StringEscapeUtils.unescapeXml(s);unescapedText = unescapedText.replaceAll("\n", "</roi>\n<roi>");

sb.append(unescapedText);sb.append("</roi>");yytext.replace(start, yytext.length(), sb.toString());

}};

}

Before: interacts directly with the Monq.JFA API

Task 4: Reduce boilerplate code

33

public class RawReader extends BaseReader {public RawReader() {

super(UNMATCHED_COPY);super.addActionToRegex(".+");

}

@Overridepublic String execute() {

StringBuilder sb = new StringBuilder();sb.append("<roi>");

String s = solveEscaping(output.toString());String unescapedText = unescapeXml(s);

sb.append(unescapedText);sb.append("</roi>");return sb.toString();

}}

After: interacts with our API, simplifying implementation

Task 4: Reduce boilerplate code

34

Dependencies

• Text parsing is covered bythe NLP module;

• the level of NLP varies per request!!

• if dependencies are required, the NLP module does dependency parsingfor all requests (evenrequests that only requirea lower level of parsing)

Chunking

Tokenization

Task 5: Dynamic NLP

35

Dependencies

Chunking

Tokenization

NLP

Task 5: Dynamic NLPBefore: GDEP initialized for dependency parsing, dependencyparsing is executed for all requests

36

Dependencies

Chunking

Tokenization

NLP

Task 5: Dynamic NLPBefore: GDEP initialized for dependency parsing, dependencyparsing is executed for all requests

37

Dependencies

Chunking

Tokenization

NLP

Task 5: Dynamic NLPAfter: GDEP fork, initialized for dependency parsing, but usingdifferent parsing levels per request

38

Dependencies

Chunking

Tokenization

NLP

Task 5: Dynamic NLPAfter: GDEP fork, initialized for dependency parsing, but usingdifferent parsing levels per request

39

Thank you for your attention

40

Date post:	22-Jan-2018
Category:	Technology
Upload:	eduardo-duarte
View:	124 times
Download:	0 times

Platform for dynamic deployment of information extraction web services

Technology