+ All Categories
Home > Documents > PANACEA WP3 The Platform WP participants: UPF, ILC, ILSP, LG, DCU, ELDA Final Annual Review 19 th...

PANACEA WP3 The Platform WP participants: UPF, ILC, ILSP, LG, DCU, ELDA Final Annual Review 19 th...

Date post: 29-Dec-2015
Category:
Upload: john-miles
View: 215 times
Download: 1 times
Share this document with a friend
Popular Tags:
37
PANACEA WP3 The Platform WP participants: UPF, ILC, ILSP, LG, DCU, ELDA Final Annual Review 19 th February 2013 Marc Poch, UPF ([email protected]) 1
Transcript

PANACEA WP3 The Platform

WP participants:

UPF, ILC, ILSP, LG, DCU, ELDA

Final Annual Review

19th February 2013Marc Poch, UPF ([email protected])

1

Summary

• Objectives• Platform components / Demo• Achievements

– Functional platform– Interoperability: Travelling Object, Common Interfaces,

format converters, etc.

– Scalability• WP7 Evaluation• Conclusions and future work

2

Objectives

Development of a platform (a space of interoperability defined by standardized protocols and common interfaces) for the easy integration of a variety of software components, tools and methodologies deployed as web services to configure a factory for the automation of acquisition, processing and annotation of language resources.

3

WP3.1. (T1-T6) Architecture and design of the platform WP3.2 (T15-T30) Work Flow editor and engine WP3.3. (T7-T30) Common interfaces, middleware and temporal files, journaling, etc. WP3.4 (T15-T30) The Registry WP3.5 (T7-T30) Deployment of web services of the components supplied by WP4 to WP6

4

Tools to be integrated Web

Service wrapper

The Registry

Common Interfaces

Format Converters

Workflow editor and

engine

Sharing workflows

From local toolsto

sharing workflows

Clients: Java, Python, Perl, etc.

Platform tools and portals

6

JAX-WS, Axis, CXF,

etc.

Workflows Social NetworkRegistryWeb Services

Share tools(remotely run

distributed tools)

Share and find Web Services

Call / chain Web Services

Share and find workflows

SOAP or REST

Soaplab

Biocatalogue Tavernawww.taverna.org.uk

PANACEA Registry:registry.elda.org

PANACEA myExperiment:myexperiment.elda.org

myExperiment

PANACEA Platform: uses, adapts and improves myGrid tools for eScience (used in biology, social science, music, astronomy, multimedia and chemistry).

Technological option:Web Services

SOAPLAB 2 (SOAP)

• Easy deployment of command line tools as WS. (Java, Python, C++, UIMA, etc. )

• Clients: Java, Python, Perl, Taverna, etc.

• No coding needed! Only metadata

• “Polling” techniques for long lasting tasks

• Web form to run the web services• URL input / output ready• PANACEA improvement for SOAP

messaging (network usage and memory)

• PANACEA limit multiple users

TAVERNA

BioCatalogue

Web Services

Workflow editor

Registry

Social network myExperiment

7

Technological option:Registry

SOAPLAB 2 (SOAP)

• User friendly GUI• Free, open source, Continuously

maintained • Search function• Users rating (users feedback)• Service annotations and Language

Categorization (PANACEA)• Monitoring system (web service

status and data results)

TAVERNA

BioCatalogue

Web Services

Workflow editor

Registry

Social network myExperiment

8

Passe

d

War

ning

Failed

Unche

cked

Technological option:Taverna

SOAPLAB 2 (SOAP)

• User friendly GUI• Free and open source• Continuously maintained (v. 2.4) • SOAP and REST web services• Credentials manger (passwords,

certificates, etc.)• Multiple files processing (“lists”)• PANACEA Workflows, best

practises, videos, etc. :• Parallelization, Error recovery:

“retries”, Polling• PANACEA collaboration: bug fixing

and pre-release tests

TAVERNA

BioCatalogue

Web Services

Workflow editor

Registry

Social network myExperiment

9

Demos• Previous Review:

– PANACEA Registry / PANACEA myExperiment– Run Web Services and Workflows– Design and merging of workflows in Taverna

• Final Review: Specific examples– Creation of a bilingual dictionary– Twitter NLP– Web cleaner and anonymizer– PANACEA Registry / PANACEA myExperiment

11

Demos ICreation of a bilingual dictionary– http://myexperiment.elda.org/workflows/93

– Input: Pairs of Basic Xces Documents• English: http://nlp.ilsp.gr/panacea/Bilingual/data/20101222/LAB_EN_FR/www.ilo.org/1.xml

• French: http://nlp.ilsp.gr/panacea/Bilingual/data/20101222/LAB_EN_FR/www.ilo.org/191.xml

1. Sentence alignment: Hunalign (3rd party tool) Interoperability

2. PoS tagging: Treetagger (3rd party tool) Interoperability

3. Build phrase tables: Moses (3rd party tool) Interoperability

4. Bilingual dictionary extractor

Video: http://ws02.iula.upf.edu/panacea/examples/videos/Panacea_bilingual_dictionary_extraction_v01.mp4

12

Demos II

Twitter NLP + Registry(3rd party tool) • This web service is based on the Twitter NLP tool developed by

Noah's ARK group. • Noah's ARK group is Noah Smith's research group at the

Language Technologies Institute, School of Computer Science, Carnegie Mellon University.

1. Search the WS in the Registry

2. Check monitoring system

3. Use web client with example data13

Demos III

Web cleaner and anonymizerhttp://myexperiment.elda.org/workflows/98

• Input: a list of URLs to process– Example: a web article from www.fifa.com

1. ILSP Web cleaner and text extractor WS

2. UPF Anonymizer WS– Internally calls Freeling NER WS (3rd party tool)

Interoperability

14

Video: http://ws02.iula.upf.edu/panacea/examples/videos/Panacea_web_cleaner_and_anonymization_v01.mp4

WP3 Achievements• Functional and Operational Platform

– Multiple tools, webs and features – Ready to use – Usability – Real Users

• Interoperability – Common Interfaces – Travelling Object – 3rd party tools Integration – Format converters

• Scalability – Web service scalability: long lasting tasks – Workflow design optimization: robustness – Machine resources: handling parallel requests

15

Functional and Operational Platform

• PANACEA Registry – 157 web services PANACEA WS benefits: WS

are easy to deploy (low maintenance cost) – More than 1300 annotations Usability / Doc.

– A cloud of 164 tags – Monitoring system: WS up and running 94.82%

since their deployment (97%) Availability

• PANACEA myExperiment – 74 shared workflows

• Storage System Usability 16

Functional and Operational Platform:

Tutorials and Documentation

17

• Tutorials • Specific and General tutorials • More than 12 videos Usability• Frequently Asked Questions

• Documentation • Registry annotations, tags and Categories • Common Interfaces documentation: xml, web, etc. • Travelling Objects documentation

Functional and Operational Platform:

Users

18

• WP7 Validators• Linguatech (WP8)• Qualia (Business intelligence)

• CNGL (Centre for Next Generation Localisation)

• INCYTA (Translation)

• Master and Phd Students make use of the PANACEA platform

• http://ws02.iula.upf.edu/panacea/statistics/upf-statistics.html

• Three levels of interoperability:– COMMUNICATION PROTOCOLS: Soap, Rest– DATA

– PARAMETERS

• Format N

Tool A

• Format M

Tool B

• Format L

Tool C

• Format N

Tool A

• emptyTool B

• emptyTool C

Interoperability

Tool B does not “understand” format N!All tools understand the previous format

Tool A

Tool B

ABCD

ABCD

Tool A

Tool B

YTQZ

ABCD

20

Common Interface• A Common Interface (CI) defines the mandatory

parameters for every functionality:

PoS Tagger A

MANDATORY: inputlanguage

OPTIONALS:

Param A

PoS Tagger B

MANDATORY: inputlanguage

OPTIONALS:

PARAM 1PARAM 2

http://panacea-lr.eu/en/info-for-professionals/documents/http://registry.elda.org 21

Travelling Object• The Travelling Object (TO) is the common data and

metadata format used in PANACEA to make components understand each other. (Interoperability)

• TO1 is the minimal common vertical in-line format used by the deployed tools since the first version of the platform using XCES standard

• TO2 GrAF standard: The Graph Annotation Format (Ide and Sudermam, 2007) is the XML serialization of LAF (ISO 24612, 2009)

• LMF for lexical resources• CONLL for parsers• Converters and adapted WS outputs 22

Format Converters31 Format converters on the PANACEA Registry• Freeling to TO. CNR

http://registry.elda.org/services/207 • KAF to TO. CNR http://registry.elda.org/services/208• Basic Xces to txt. CNR

http://registry.elda.org/services/209 • PoS tag. (Freeling treetagger) to GrAF. UPF http://registry.elda.org/services/142 • Dependency parsing (Freeling) to GrAF. UPF http://registry.elda.org/services/197• Dependency CoNLL to GrAF. CNR

http://registry.elda.org/services/254 • Word doc to txt. UPF http://registry.elda.org/services/112 • In-house mwe to LMF. CNR http://registry.elda.org/services/296 • Pdf to text. UPF http://registry.elda.org/services/116 • Multi. encodings converter (ISO, UTF, etc.). UPF http://registry.elda.org/services/114 • Aligner to TO. DCU http://registry.elda.org/services/69 • Sentence alignment to TMX. DCU

http://registry.elda.org/services/219 • Treetagger to MOSES. DCU http://registry.elda.org/services/275 • UIMA to GrAF. ILSP http://registry.elda.org/services/182

• METASHARE metadata generators http://myexperiment.elda.org/workflows/96

23

3rd party tools integration

• PANACEA WS wrapper (Soaplab) and the CI make it easy for WS Providers to integrate 3rd party tools.

• ILSP tools are UIMA tools UIMA• Freeling UPC• Treetagger University of Stuttgart• Twitter NLP Carnegie Mellon University• MALT Parser Uppsala University• DeSR Università di Pisa• MOSES / Giza++• DELiC4MT (MT evaluation) DCU• Berckeley tagger, parser, aligner Berkeley University California

24

Web Services Scalability• Web services are being deployed using Soaplab 2.3.2:

– Service providers only need to use metadata (ACD) files Usability– Web client application to test WSs: Spinet Usability – PANACEA developers have been in contact with Soaplab developers Collaboration

– SOAP protocol standard Interoperability• WS can be called from Taverna or other workflow editors• WS can be called with many programming languages: Python,

Perl, Ruby, Java, etc.– Soaplab polling to avoid client timeouts Scalability– PANACEA Improvements Scalability

• Parallel request limit system • SOAP messaging optimization

25

Workflows design optimization: Robustness

• Building workflows with Taverna– Version 2.4.2 Scalability– Polling (Soaplab) Scalability

• long lasting web service calls without timeouts– Retries Scalability– Parallelization Scalability– Tutorials and videos Usability

27

Machine Resources: handling parallel requests

Parallelization level 3 (3 parallel request per service * 2 services = 6 concurrent requests)

Workflowname Freeling_tagging_for_crawled_data_with_output_downloadfile massive_freeling_for_crawled_data_v11_download.t2flowmyexp url http://myexperiment.elda.org/workflows/32Taverna 2.4.0 workbench

VM Cores RAM HDiula04 (UPF) 4 8 40GB (SAS)

WS parall. poll. int. poll. backoff poll. max int. retries ini. delay max factorWS1 python_preprocess + freeling_tagging

+ python_postprocessing3 2000 1 10000 2 5000 150000 20

WS2 postagger_to_xces_converter 3 2000 1 10000 2 5000 150000 20

corpus list file urls url example TokensMCv2 LAB_ES_list.sorted.txt 13188 http://nlp.ilsp.gr/panacea/D4.3/data/201109/LAB_ES/1.xml 61 M

Name Status Queued it. It. done It. w/error Average time/it.

Freeling_tagging_for_crawled_data_with_output_download

Finished - - - 5.2 h

download_dataUrl Finished 0 13188 0 31 msfreeling_tagging Finished 0 13188 5 4.2 spostagger_to_xces_converter Finished 0 13188 0 4.1 s

29

Machine Resources: handling parallel requests

Parallelization level 10 (10 parallel request per service * 2 services = 20 concurrent requests)

Workflowname Freeling_tagging_for_crawled_data_with_output_downloadfile massive_freeling_for_crawled_data_v11_download.t2flowmyexp url http://myexperiment.elda.org/workflows/32Taverna 2.4.0 workbench

VM Cores RAM HDiula04 (UPF) 4 8 40GB (SAS)

WS parall. poll. int. poll. backoff poll. max int. retries ini. delay max factorWS1 python_preprocess + freeling_tagging

+ python_postprocessing10 2000 1 10000 2 5000 150000 20

WS2 postagger_to_xces_converter 10 2000 1 10000 2 5000 150000 20

corpus list file urls url example TokensMCv2 LAB_ES_list.sorted.txt 13188 http://nlp.ilsp.gr/panacea/D4.3/data/201109/LAB_ES/1.xml 61 M

Name Status Queued it. It. done It. w/error Average time/it.

Freeling_tagging_for_crawled_data_with_output_download

Finished - - - 2.2 h

download_dataUrl Finished 0 13188 0 29 msfreeling_tagging Finished 0 13188 5 5.9 spostagger_to_xces_converter Finished 0 13188 0 4.8 s

30

Machine Resources: handling parallel requests

• From 1x to 10x experimenthttp://ws02.iula.upf.edu/panacea/examples/videos/Panacea_parallelization_scalability_v01.mp4

– Two Taverna instances running at the same time– 100 documents to be processed– 1 workflow with NO parallelization / the other with 10x– The same server: ws04 with 8GB RAM and 4 CPUs

• More resources > more parallel requests

31

Machine Resources: handling parallel requests

• Conclusions:– PANACEA fulfils large data scalabilty goal Scalability– Requirements:

• Robust WS deployment: Soaplab (with Panacea improvements) or other robust framewoks.

• Taverna 2.4• Workflow design must follow the PANACEA massive data tutorial (retries,

polling, etc)

• The architecture is highly scalable: growth is just a matter of resources

Statistics

Typical Panacea server:• 2 - 4 cores• 4 - 8 GB RAM• 30 - 100 GB HDD• 100 Freeling WS parallel

requests

EMBL –EBI (European Bioinformatics Institute in Cambridge):• 200 Servers• 2000 cores• Server requests balancing Software, etc.More than 50000 Freeling WS parallel requests

32

33

PANACEA WP7

Final evaluation cycle

Platform Validation

Partners:

CNR, ELDA, LG, UPF, UCAM

Review 2013-02-19, Luxembourg

Valeria Quochi CNR ([email protected])

34

Platform v3 validation – Objective

– Validation of the integration of components: test functionality of middleware

– Criteria – Final technical, functional and quality requirements:

defined in D7.1 (sec 2.2)

– Methodology (like previous cycles)– Definition of validation scenarios organised in tasks for

testing specific requirements– 2 validator “types”: platform users (3), service providers (2)

– 2 “external” & 1 internal platform users, 2 “internal” service providers

– Training phase (1week)– Testing phase through tasks (+ material provided) and

questionnaires for reporting results

35

Validation Scenarios

5 scenarios for testing:- The registry: service availability, annotation and monitoring

(platform user validator)- Web Services: accessibility, response times, …

(platform user validator)- Interoperability: service combination into workflows

(platform user validator)- Interoperability: service design/integration (common

interfaces, i/o formats) (service provider validator)

- Security: security of web-services and platform (service provider validator)

Criteria v3

36

Criteria Scenario(s)

Req-TEC-0004 – Annotating services A Req-TEC-0005 – Web service monitoring A Req-TEC-0105 – (1st cycle) Metadata description A * Req-TEC-0101c – Components accessibility – 3 B Req-TEC-0102 – Components time response B * Req-TEC-0103 – Components time slot B ● Req-TEC-0208 – Checking of matches among components C *Req-TEC-0304c – Common Interfaces design – 3 D Req-TEC-0305 – Adding of new components D ● Req-TEC-1103 – Privacy E Req-TEC-1104 – WS Authentication E Req-TEC-1203 – Versioning Checked apart

Validation results

37

From validation reports:- 53 Fulfilled- 5 Partially Fulfilled- 13 Not Fulfilled ( validators lack of experience/expertise)

the platform is functional and operative. Clear progress; important requirements and main expectations are fulfilled

Problems in validation mostly due to web services, not to middelware

.

- Critical Issues:- Metadata and annotation of services and workflows- Lack of web client for non SOAPLAB services reduced access- Need of converters to adapt to TOs

Lessons learnt/ recommendations for SPs

38

- Spinet made a difference Web clients/GUIs for direct usage/testing of services

- Less documentation & annotation = More difficult usability invest more service documentation, i/o examples, …

- Existing WFs helped creating new ones SPs should make and share workflows with their services

- Interoperability can be improved more converters, create workflows for conversions

Conclusions• Functional platform

– Web services software – Registry / myExperiment

• Usability for users and providers • Interoperability:

– Data formats – Common Interfaces

• Tutorials and Documentation • Scalability

39

The future

• Authentication Web Services Business opportunity– Institutions and companies can sell their services and/or machine resources

• Automatically build workflows Usability and interoperability – Based on input data and user desired output, etc.

• Data Visualization tools / Widgets Usability

• Improve total throughput Scalability– With more machine resources we can achieve faster experiment results– Software optimization: task splitting and parallelization

• Publications with experiments Research – Researchers could link their publications to real experiments (WS, workflows, data.

etc.)– Fostering research making experiments easily replicable– Improved experiments: more data, more machine resources, faster results, etc.

40

Thank you

Questions?

41


Recommended