Date post: | 22-Jan-2018 |
Category: |
Technology |
Upload: | eduardo-duarte |
View: | 124 times |
Download: | 0 times |
Platform for dynamicdeployment of information
extraction web services
Supervisors:
Author:
David Campos
José Luís Oliveira
Eduardo Duarte
Context
Automation of
• Information extraction (text mining)
• Concept recognition (machine learning, dictionaries)
in biomedical documents
2
Pre-processing
ConceptRecognition
Post-processing
unprocessedCorpus
processedCorpus
ContextOverall workflow of IE solutions
Ontologies3
Main Objectives
•Develop a second version of the tool(backwards-incompatible changes)
•Make the tool easier for developers andresearchers
• Create a dynamic, persistent space wheremultiple information extraction tools can beindependently managed.
5
The ProblemWeb services platform
User sends input document to the server and retrieves anannotated document
6
The ProblemWeb services platform
User sends input document to the server and retrieves anannotated document
7
The ProblemWeb services platform
User sends input document to the server and retrieves anannotated document
8
The ProblemWeb services platform
If the User needs to use his custom ontologies, they must sentto an Administrator first
9
The ProblemWeb services platform
If the User needs to use his custom ontologies, they must sentto an Administrator first
10
The ProblemWeb services platform
If the User needs to use his custom ontologies, they must sentto an Administrator first
11
The ProblemWeb services platform
If there are multiple users that need different ontologies (if evensome of them are the same), we need multiple server instanceswith different configurations
12
Instead, we should have a single machine that:
- stores all ontologies (non repeated);
- manages ontologies used per client.
The ProblemWeb services platform
13
Instead, we should have a single machine that:
- stores all ontologies (non repeated);
- manages ontologies used per client.
The ProblemWeb services platform
14
Instead, we should have a single machine that:
- stores all ontologies (non repeated);
- manages ontologies used per client.
The ProblemWeb services platform
15
Instead, we should have a single machine that:
- stores all ontologies (non repeated);
- manages ontologies used per client.
The ProblemWeb services platform
16
Instead, we should have a single machine that:
- stores all ontologies (non repeated);
- manages ontologies used per client.
The ProblemWeb services platform
17
Instead, we should have a single machine that:
- stores all ontologies (non repeated);
- manages ontologies used per client.
The ProblemWeb services platform
18
Additionally, ontology loading should occur on-demand. In otherwords, they should be used automatically for all subsequentrequests.
The ProblemWeb services platform
19
Additionally, ontology loading should occur on-demand. In otherwords, they should be used automatically for all subsequentrequests.
The ProblemWeb services platform
20
Additionally, ontology loading should occur on-demand. In otherwords, they should be used automatically for all subsequentrequests.
The ProblemWeb services platform
21
Additionally, ontology loading should occur on-demand. In otherwords, they should be used automatically for all subsequentrequests.
The ProblemWeb services platform
22
• create a on-demand server with all of Neji’s features available on the web;
• this server should:
• receive processing REST requests (GET and POST);
• manage multiple simultaneous web services;
• persist ontologies and jobs to avoid data-loss.
RequirementsWeb services platform
23
To achieve this dynamism in the web platform, we needed to make multiple changes to the core tool codebase
• Overall development Java SE 7
• Versioning Git
• Unit testing JUnit
• Continuous Integration Jenkins
• GDEP fork C++
• Text pipelining Monq.JFA
• Multi-language NLP support OpenNLP
• REST e Web Services Jersey
• Embedded server Jetty
Tasks & Technologies involved
24
Task 1: Data sharing between modules
• Modules receive an input StringBuffer and performtransformations on it• but what if the modules need to share more data
between them?
• data sharing between module instances in a non-deterministic environment is problematic
• however, the pipeline instance itself is already a sharedresource
• Restructured module arquitecture so theyhave access to a blocking list on the pipeline
25
Task 2: Multiple output formats
Before:• a pipeline holds a single writer and
produces a single output file• to obtain the same processing in
different formats, we need to runthe pipeline multiple times
After:• a pipeline holds multiple writers and
produces multiple output files• only one pipeline execution
required28
Task 3: Pipeline validationBefore: exception thrown on Dictionary, pipeline thread needsto be interrupted manually
pt.ua.tm.neji.exception.NejiException: Dictionary.javafrom monq.jfa.ReSyntaxException: monq.jfa.DfaRun.java
Writers
DictionaryTagger• Requires tokens• Provides annotations
Reader• Requires nothing• Provides passages
29
Task 3: Pipeline validation
Writers
After: exception thrown before Pipeline thread is even launched
pt.ua.tm.neji.exception.NejiException: Dictionary requiredTokens, which were not provided by earlier modules in thepipeline.
DictionaryTagger• Requires tokens• Provides annotations
Reader• Requires nothing• Provides passages
30
Task 3: Pipeline validation
Writers
After: exception thrown before Pipeline thread is even launched
No exceptions thrown
Reader• Requires nothing• Provides passages NLP
• Requires passages• Provides tokens
DictionaryTagger• Requires tokens• Provides annotations
31
Task 3: Pipeline validation
@Requires(Tokens)
@Provides(Annotations)
public class DictionaryTagger extends BaseTagger {
…
requirement and provision of resources is enforced with Java annotations on each module
32
public class RawReader extends BaseReader {public RawReader() {
try {Nfa nfa = new Nfa(“.+”, action);setNFA(nfa, DfaRun.UNMATCHED_COPY);
} catch (ReSyntaxException ex) {throw new NejiException(ex);
}}
private AbstractFaAction action = new AbstractFaAction() {public void invoke(StringBuffer yytext, int start, DfaRun runner) {
StringBuilder sb = new StringBuilder();sb.append("<roi>");
String s = XMLParsing.solveXMLEscapingProblems(yytext.toString());String unescapedText = StringEscapeUtils.unescapeXml(s);unescapedText = unescapedText.replaceAll("\n", "</roi>\n<roi>");
sb.append(unescapedText);sb.append("</roi>");yytext.replace(start, yytext.length(), sb.toString());
}};
}
Before: interacts directly with the Monq.JFA API
Task 4: Reduce boilerplate code
33
public class RawReader extends BaseReader {public RawReader() {
super(UNMATCHED_COPY);super.addActionToRegex(".+");
}
@Overridepublic String execute() {
StringBuilder sb = new StringBuilder();sb.append("<roi>");
String s = solveEscaping(output.toString());String unescapedText = unescapeXml(s);
sb.append(unescapedText);sb.append("</roi>");return sb.toString();
}}
After: interacts with our API, simplifying implementation
Task 4: Reduce boilerplate code
34
Dependencies
• Text parsing is covered bythe NLP module;
• the level of NLP varies per request!!
• if dependencies are required, the NLP module does dependency parsingfor all requests (evenrequests that only requirea lower level of parsing)
Chunking
Tokenization
Task 5: Dynamic NLP
35
Dependencies
Chunking
Tokenization
NLP
Task 5: Dynamic NLPBefore: GDEP initialized for dependency parsing, dependencyparsing is executed for all requests
36
Dependencies
Chunking
Tokenization
NLP
Task 5: Dynamic NLPBefore: GDEP initialized for dependency parsing, dependencyparsing is executed for all requests
37
Dependencies
Chunking
Tokenization
NLP
Task 5: Dynamic NLPAfter: GDEP fork, initialized for dependency parsing, but usingdifferent parsing levels per request
38
Dependencies
Chunking
Tokenization
NLP
Task 5: Dynamic NLPAfter: GDEP fork, initialized for dependency parsing, but usingdifferent parsing levels per request
39